 Well hello everybody and welcome again to another OpenShift Commons briefing. This time I'm really pleased to have with us the folks from Co-Scale, Fred Rickbosch and Samuel Van Dam, who's been with us before, and they're going to talk about using some of their Co-Scale tools and services for proactive performance management of OpenShift and under the Kubernetes, and so I'm going to let them introduce themselves. The format of this session is, there's a chat, ask questions, one of us will try and answer them. After the presentation and demo is done, we'll open it up for Q&A and follow-up questions. So please take it away, Samuel and Fred, looking forward to hearing more from you guys. Okay, thanks. Thanks a lot for the introduction. So I'm Samuel Van Dam, and as mentioned, I'm joined here today by Fred Rickbosch, our Co-Scale CTO. I'm very glad you could join us today, Fred. Happy to be here. Okay, so as mentioned, the topics of today's webinar is all around performance management of an OpenShift environment, and we really want to have a discussion on the performance considerations of running applications on OpenShift in production, and really focused on production. And especially, we want to put in a lot of time on how you can address these and, yeah, these performance considerations with a container monitoring platform. Now, we will start with a couple of scenarios that we've seen happen at customers and what the effects are that you can see on the OpenShift environment. Now, I know that Fred has put in a lot of time this week to set up a very nice OpenShift environment. So, Fred, could you show us a little bit what you've set up for us today? Sure. So let me go to the OpenShift UI for that. So as you can see here, we have a lot of things running. We have MySQL, Nginx, and so on running. But what this thing actually does, it's a word count application. So if you go to this URL here, you can see how it works. So here, you can submit some words, and then you get some statistics about the words that were entered before. So you can see the most used words, the most entered words, and the most entered letters, actually. So it's a very simple application, very basic. However, somebody, someone, over-engineered this a little bit. So he put an Nginx first. Nginx sends traffic to the receiver that goes, puts things into RabbitMQ. Then there is a service that picks it up and puts it into MySQL. And then there are other services who process the data and put it into Redis. And that's served to the customer again. So we can see those services here, and we can see the workers below. So we have some workers for calculating letters, calculating the words, a processor, and so on. So this is what we're doing in this application. Of course, I also installed CoSkill, and we're monitoring this environment. So here, you can see that we are monitoring OpenShift, but also all the things running on OpenShift. So Nginx, Java processors, RabbitMQ, everything I just mentioned. Okay, that looks actually, in my opinion, like a really cool application. But can you maybe explain to the people that don't know OpenShift in detail in the background what happens when you send the request to this application? What does OpenShift manage for us and what details like that? Sure. So for creating a route, I just click this Create Route button. So I created the route for Nginx. This means that the traffic going to this URL will be sent to my OpenShift notes. Once the traffic arrives at my OpenShift note, the OpenShift proxy will have a look at what the URL was. And based on the URL, so if it sees that it is Nginx words dot this thing, then it will send it to the Nginx service. Then the Nginx service will receive that request. If we have a look at the Nginx service in the code, we can actually see that the Nginx service talks to the receiver. And here we provide an environment variable. So this is some PHP code for getting an environment variable. So using an environment variable, we set the receiver host. In our case on this cluster, the environment variable is just set to receiver. Because the receiver is in the same namespace as the Nginx, OpenShift will add it to the DNS, and Nginx will automatically be able to resolve the receiver. So it will know how to contact the receiver and how to talk to the other services on the cluster. Okay. Pretty cool. So would you consider this a microservice environment? Well, it depends on how you talk to, I think. So there are a lot of services in here, and they all have their own purpose, their single purpose. So in that aspect, it is a microservice environment. However, some people would say that the data has to be isolated per microservice. So if we look at the Redis, the Redis is being communicated to by three different services. So some people would say that violates a microservice architecture. So perhaps not for everyone, but I would consider this microservices. Okay. So it seems I still have something to learn. Now, when you're talking to customer of ours, do you see mostly that they're deploying these small microservices, or are some of our customers still deploying their old applications into these OpenShift environments? Yeah. So we are seeing both. Basically, some customers are doing completely new environments, very greenfield technology, using microservices to get things going, and they use OpenShift for that to scale it easily and so on. Others are coming from more, have more legacy already, and they are putting their monolithic applications into OpenShift. And then they try to split off parts. They say, okay, this component looks very isolated. So we'll create a separate container for that component, and they will split it off. So we should see that some people take a more gradual approach, start with the monolith, and split it off while they are already running it on OpenShift. Okay, very cool. And what would you say is the big advantage of splitting everything up into small components like this? Yeah. So on the business side, most people are attracted to this because if you split everything into small components and the components are more isolated, you can iterate faster on those components. So you can make sure that if you have a new feature that you want to add to one component, you can add that fast without affecting the whole system. So you don't have to build the whole system again. It's a lot faster to get this into production. And from a technical perspective, they're really interesting because now these smaller services can be distributed across multiple nodes. You can scale them really easily, and it's more resilient against failures by that, because you have multiple instances running on multiple nodes. So you're saying it's more resilient against failures. Could you maybe show an example of that? How would that look in these environments? Sure. So let me go to a co-scaled dashboard. So on this dashboard, we can see the free space on one of the nodes in the cluster. So I will mention this first. So I have a cluster of 10 nodes. I have two infer nodes. I have three masters, and I have five nodes. Here we can see the containers that are running for my words namespace. So I selected this namespace, and we can see the containers running for that namespace. Now, if you look at this graph here, you can see the free disk space on one of the nodes. You can see the disk was actually almost full. And then somebody started the process of this, started filling up the disk. So it was filling up the disk. And at this point, we notice that there is an event. So we can see here that there is an event. The status of node one changed from disk has sufficient disk to node out of disk. This means that OpenShift will not schedule new nodes, will not be able to spot on this container. So however, what we also see here is that for node one, there are still containers running. So it's not because it went out of disk that OpenShift says, I have to remove all the pots from this, from this node. So these are the type of things that you really want to get visibility into, right? You want to see what's going on with my underlying operating system, what events is this causing at the OpenShift level, and what is it doing to my containers? So in this case, it's not doing anything to our containers. Okay, that makes sense. But I can imagine that a node running out of disk, you probably have logs, you have other areas that might impact your service, that the service looks fine, but it's actually not delivering any requests. Can we also see that? Yeah, so that's why you need in-container visibility. You have to have a look at services that are running inside your containers, whether these are performing as you expected. Because when the disk is running full, it might have an impact on these services, and you want to be aware of that. You want to make sure that it's not only when things crash that you are notified, you want to know in advance. Okay, pretty cool. So that's a little bit with OpenShift and managers for you. If a node is in a bad state, it's going to make sure that if a container crashes, it goes to another machine. It pretty much makes sure that your applications keep running. So OpenShift will do everything it can to keep the pods running that you requested. Very cool. Maybe a little bit of a silly question, but how easy is it to scale out this, let's say the NGINX service that you have running? Okay, so let's go back to the interface. So here we can see the NGINX service. If I click this open, I can see the number of pods that are running. So at the moment, we have four pods running for the service. I can easily just click here and it will start scaling. If I look at the deployment, then if I'm fast enough, then we'll see that the container is creating at that moment, and now it has started. So we actually went from four instances of NGINX to five with one click. Of course, you can also do this through the CLI so that you can automate this. Okay. Maybe in terms of the request, you just added an NGINX. What's the effect then from a monitoring point of view? So that's a very important question, I think. Your monitoring has to be aware of these things. So you have to make sure that your monitoring tool knows what is going on, right? So if we have a look here, we can see for the NGINX, I previously scaled it up from one container to three containers, and that is what we can see right here. So the yellow line shows that there was one container running at this point in time, and then at three o'clock in the afternoon, I scaled it up to three containers, three pods. But actually what happened, you can see here. So you can see this container kept on running. So the green area indicates where the container was running. At three o'clock, another one was started. Actually, two containers were started. The second one accepted again, and OpenShift scheduled another one for me. So we killed this one, and OpenShift said, okay, you asked for three, so we will schedule another one for you. So you can really easily see in this graph what is going on. When our container started, when are they stopped, and how is this affecting my metrics. So if we go to a dashboard, oh, okay. So I think if we go to a dashboard, we can actually see this. So this is my NGINX, and I can see here for the whole service that the CPU load dropped, the average CPU load for the whole service dropped here. So if we open this up, then we can see, okay, for the words namespace, there is a replica set NGINX. If I open this, I can see all of the pods for that namespace for that replica set. Excuse me. So right here, I can see that we had one container running, multiple containers are joining, and the CPU load drops. This is, of course, because the load is load balanced. So the requests that we're going to one container are now going to three containers. So you can see this for all these metrics. We can, of course, also have a look at this at the service level. So if I go back, if I click forward now on NGINX here, I can actually see some more in-depth NGINX metrics. So I can see the number of requests that are coming in. So if I click on the number of requests, then I can see that there is also a drop in the average number of requests that is being served. That's strange, but it's very logical, right? So if we open this up, we can again see multiple containers joined, and they took over the request. If I now stack this graph, then I can actually see that this is very normal behavior. I had about two requests per stack before. We scale up the replica set, and we can see that there are three containers, and they are all serving equal traffic for that service. Yeah, that really makes sense, of course. Maybe a question. You gave us an example on a node actually failing, but I can imagine a lot of scenarios where, yeah, you want to do this. I don't know, you need to maintain a machine, or you're seeing hardware errors on a specific one, and you think it's better to take it offline. What would be the steps you need to take to be able to do that? Yeah, so for example, when you require maintenance on your machines, so this can happen, right? You have a security update that has to happen to the underlying operating system, and for that security update, you have to reboot. So at that point, you want to evacuate that node. You want to drain the nodes. You want to say to OpenShift, drain the nodes. It will evacuate all the pods, and it will reschedule them on different nodes. So I don't know where you can do this in the UI. That would be good information, but I know you can do it from the command line. So there is this OADMIN, the OpenShift Administrator tool, and there you have options, and one of the options is drain. So here we can see drain node in preparation for maintenance. So we can just say OADMIN and then drain, for example, my node 2. If I do this, yeah, I will get some warnings that I want to ignore the demon sets. So let's do that. Now it's ignoring the demon set, so the demon set keeps running on that machine, and it's evacuating all of the other pods running there. So we can see that a receiver, an NGINX, and a Redis are being evicted, and these new pods for these services will be started on different nodes. Yeah, that's pretty cool. And the advantage is really that because you have multiple of these pods, the traffic's being rerouted, and the user is not impacted. Yeah, that's what should happen, right? So let's have a look here at this dashboard. So this is a scenario I made before. So also draining node 2. So here we can see that at this point in time, node 2 has two containers. It has a Redis container and a receiver container. For both of these services, actually, I created a graph. So we can see the number of successful requests to the receiver. So I know that the orange one is the one that is running on node 2, because I looked at that before. And I know there's only one Redis running, and it's this one. It's running on that node. If we now go forward in time, just a bit, then we'll see at 14.54, the node 2 started training. So this is at this point. And then some strange stuff starts happening right here. I'll explain this later. So what we can see here is that we have one Redis pod running. Then at this point in time, it's evacuated. So it stopped on node 2. Then it started on a different node. There's a bit of a gap here. That is because if you schedule it on a different node, and the image is not yet on that node, the image has to be pulled, and container has to be started. So it can take a while. And the error here actually is that there is only one Redis node running. So we can see that if you only have one node, if you have one pod running, and the node that runs that pod fails, then you lose the service, of course. And that's also what's impacting us here. So we can see that the number of successful requests is very low in this area. And that is because there is no connection to Redis. So there is no Redis at that point, and things start failing. However, since OpenShift manages to get the spot up and running again, we can see here that requests are restored. And this yellow container is a new container that is scheduled on a different node. We can see here that node 2 is now empty. So there are no active containers on this node anymore, and the containers are rescheduled. So that's one of the things you need to watch out for when you're building these services, that you are ready for these kind of events, that your software can handle this. Yeah, you have to think about how many things do I have to run? Do I have to run in a cluster? How will my application handle this? If, for example, you have a Redis cluster and you're connected to one of the nodes, that node fails. You want to end your client code to make sure that you connect to another node in the cluster and try to request again. So these retry kind of mechanisms are also really important. And the cool thing with these tools is that you can actually see this behavior really easily. So you can actually see what's the impact of a node trade, what's the impact of a disk running foo and so on my application level. Yeah, nice. Now, it seems to work all pretty well. The OpenShift orchestrator, have you ever seen it mismanage a container? Put it on, let's say, the wrong machine, even if there is such a thing? I wouldn't say mismanage. So, for example, if you want to make sure that your containers are not interfering with each other, you can set quotas. So you can set a quota for CPU usage and for memory usage on a container. And OpenShift will make sure that on one node, if you add up all the quotas, it doesn't go over the resource usage of the machine itself. So you can make sure that it doesn't schedule containers together that consume more memory than the machine has available. However, there are other things like disk throughput, network throughput that are not on which you cannot set quotas. So this is a limitation of the Linux kernel in which it is not possible to do these quotas today. So OpenShift can also not do it. So this means if you have one container that is very disk intensive, so it writes a lot of stuff to the disk, it can actually consume all of the bandwidth to the disk. And another container on the same node, if it also requires bandwidth to the disk, can experience problems from that. So it can be starving on throughput. So one container can affect another container. And these are the kind of things that you want to see. So you want to know for all of your containers, okay, what CPU are they using, what memory are they using against their quota. But you also want to know what's the network traffic, what's disk throughput, and you want to keep an eye on this. It's very important that you do this, that you have a historical view of this so that you can see which containers can be scheduled together and which containers you should keep on different nodes, actually. So you can use mechanisms like node affinity and node labels to make sure that heavy containers are not scheduled with other containers. But you need data to come to those conclusions. Yeah, it makes sense. Maybe just to give everyone an idea on the online. So what are we seeing currently at Coastkill? Are most of our customers using OpenShift in production? Or what's the range of environments that we currently monitor? Yeah, so we started off with seeing a lot of testing environments, seeing a lot of CICD environments, so people getting their feedback, trying out small projects. But actually, lately, a lot of people are starting to push heavily. So they have tested it on these QA environments, on these testing environments, and they're now ready to go into production. So we actually have customers that are using OpenShift quite heavily in production. So we're very glad that we can help them assure that their performance is good in production. Okay. And would you say that most of the applications are currently internal applications or are you seeing external ones also? That's a good question. So most of the companies we talk to start with an internal application. So they have this internal application that they would choose to test this, where they start with, okay, there's no real end user impact that if we do this. So they try it out with that. But now we are seeing the push for more. And once they gain experience with that, they start going to more customer-facing applications. And we see a lot of customer-facing stuff starting to happen right now. It's very exciting to see all this happening, of course. Now maybe coming back to OpenShift, I can't imagine that there's some scenarios where a container is misbehaving or doing something it shouldn't. But that from OpenShift's point of view, this isn't really clear. So for example, can OpenShift handle every type of container issue or do container crash? So when a container crashes, OpenShift will help you, right? It will reschedule it and it will make sure that the pod gets back up and running. However, a lot of situations where the container is not crashing and health checks appear to be healthy. So OpenShift thinks, okay, this container is doing what it's supposed to do. But actually, if you look at other metrics inside of the container, so more performance metrics, what are the latencies of the requests that are being done, and so on, you can see that the services are suffering. So you actually need a tool that can provide visibility into this that has a wide range of plugins or can capture a wide range of metrics for all types of services. And that's what we do at CodeScale. Okay, cool. So can you give us an example of an issue we detected or an issue that we saw? Sure. So this is the ReptemQ still running in our test application. So we can here see some very global metrics very, very high level, the number of channels, the number of connections, consumers, exchanges, queues, and so on. Message rates, how many messages are coming in at the moment? What's the memory looking like? And we can see here that there's a strange trend going on here, right? So at this point, everything is fine. So there are not a lot of messages in the queue. This queue is used as it's like a job. So you put something on the queue and then somebody else would pick it up and process that data. But at this point, we can see that something is going wrong. So work starts piling up or messages start piling up. So we can see the request rate goes up and these messages are not being handled. This causes the memory to go up. So this container starts consuming a lot more memory. It keeps on going up. And at some point, it will hit the limit and it will crash or restart or so on. At that point, of course, you will lose that data. In our case, the ReptemQ is not persistent. So if it restarts, then we will lose that data. If I click here, then I can get some more detailed information. And I can actually see that there are two queues. So here we have the queue called junk. That contains a lot of messages. And then the messages queue that is actually used by the application that is being cleared often. So the work is picked up and that goes okay. So you can actually see here how to debug this. You can see, okay, there's strange behavior going on. Queue is filling up. And we can get onto the queue level. Okay, it's this queue that is filling up. And then we can mitigate that. Okay. So these are large environments. I think when talking to customers, they start small, they start scaling up after some time, after they have their tests with the system. So I can imagine that you maybe start with four containers, but after a while, you have 20 containers running the same application. And it becomes, I think, very difficult to do something like this, right? How do you monitor 20 different rabbit and queues and pick up when they're not behaving, is there supposed to? Yeah, definitely. So you want to have good dashboards that provide you good visibility. But it's not possible to look at these dashboards all of the time. So having somebody dedicated going through all the dashboards all of the time, that's not something we are interested in. So we have an anomaly detection mechanism that will alert you when there are big changes in your system, things that you should actually look at. Okay. So how do you how do you detect this changes? How long does it take, for example? Okay, so let's let's go to an example. Where's my example? Okay, there we go. So here's an example. So we can see, in this case, we are looking at the receiver. And we are looking at all the different containers that are running for our receiver. We can see that one of the containers is experiencing a strange behavior. So it's going to 100% CPU. So perhaps it's in some kind of a loop. It can get out. So it starts consuming a lot of CPU. Coast scale will look at all of the containers in a certain service. It will see that, okay, these things normally look like the same. They have the same type of behavior. As we can see here, it's very regular, the same kind of behavior. If something pops out of that behavior, then we'll alert users. So for this one, we can detect it within one minute. We can say, okay, there's something strange going on, especially because it's a very large anomaly. So the thing you can see highlighted here and pink, that's an automatically detected anomaly from pro scale. Okay, so that's very cool. And that that's running on or that's checking everything. So is it checking the nodes, the containers, the orchestrator? Yeah, exactly. So we have metrics for both the operating system, the orchestrator, the containers, and of course, the applications inside the containers. And for all of those metrics, we make models and we make sure that the models are calibrated by the data that's coming in. And if new data comes in, then if it isn't anomaly, you will get alerted for that. Okay. And so I can also receive emails for this then? Yes. Okay, pretty cool. Now, yeah, I think all the data we've been showing has been coming from co scale, which is pretty clear, I think. Now, I think the people on the webinar will probably be interested in, okay, how do I install this? How do I monitor my own open shift environment? How long will it take? Yeah. So if we go to data sources, we can see we have an agent. And at this point, the agent is installed on all of the 10 nodes in the cluster. Let's say I was starting fresh, and I wanted to recreate this thing. Then what do I do? I create a new cross scale agent. In this case, I will deploy it as a container. I will deploy it on open shift. I will talk about how to monitor the images, the containers inside your environment later on. So we'll skip that step. I can give it a name. We'll give it a name. And then we get install instructions. So let me first scroll down a little bit. So here we can see the step where the co scale agent is actually deployed. And we can see that we are using a demon set. So a demon set is a mechanism in open shift that deploys a certain container on all of the nodes in your cluster. So this way, the co scale agent container is running on all of the nodes in your cluster. So you can see here that we are mounting the docker socket and some other stuff to make sure that we can get metrics from docker, that we can get metrics from the underlying operating system. Another thing to notice here is that we are using privileged mode. So in order to read metrics from the underlying operating system, we have to have privileged mode to read proc. And in open shift, you have to do some stuff to make sure that privileged mode works. To make it easy for our customers, we actually include the instructions to do that as well here. So in the first step, we set up a security context constraints, which allows us to run a privileged container. You can also see the other things that we are using. You can review that and give this to your security team to see where they like this. Then here we create a new project, a co scale project on your cluster. We create a service account for it. We add the security context to that service account. And then we deploy the demon set. So the installation is actually as simple as copying this and pasting it into your terminal. So we can just do this right here. Of course, this was already done, but you see how that works right. And that's the whole installation. So what, five minutes, 10 minutes to set this up? Depending, of course. Yes. Okay, pretty cool. And then let's say there is a new node added, or we change the configuration. Do I need to reinstall the agent every time? No, because we're using a demon set. If you're adding a node to the cluster, OpenShift knows that there is a new node in the cluster. It will automatically also run the demon set, the container for the demon set on that node. That's very cool. So we're using OpenShift itself to deploy and monitoring pretty much. Yes. Nice. So now let's say for basic install. So we've deployed this demon set. We haven't done that much configuration, I think. What do I get out of here? What is the information that I can see? Yeah. So let's go to the home dashboard. When you do the installation with the demon set, you would see the resources from the operating system, the metrics from Docker and the OpenShift. There's a bit more configuration required to get the other metrics. We'll talk about that right away. But I can show you what is already in the OpenShift dashboard. So a lot of the widgets that I have been showing are present on these default dashboards. So if you install it, you get these dashboards immediately. So you can see how many containers are running, how many nodes do I have, how many replication controls, services, and so on. I can see the events that are happening in my cluster and some more information about builds, deployments, and so on. The cool thing is that this dashboard gives me a lot of a high level view. It's very high level, but I can click through on this. So I can click through on the nodes and get this kind of view, the same thing as we saw before. We can click on one of the containers here and go into that dashboard. So we can see that here the container was running. I can zoom in to that. I can see the events for that container. I can also click through to other technologies. So there are a lot of these dashboards. So for example, for replication controls, we have different types of dashboards and so on. So it's very purposefully made for these types of environments. Yeah, I guess when talking to customers, you put a lot of the knowledge you builds into the dashboards again. Yes, exactly. Okay, now I noticed on top, and maybe some other people also noticed these drop downs. You have replication controller namespace in this case. Can you maybe explain a little bit what that is or what that does? Yeah, sure. Let's go back to the node dashboard first. So for our nodes, we can see that here I have selected the words namespace. So I can see all of the containers that are running for that namespace. If I now click a different namespace, for example, co-scale in which the co-scale agent is running, I can see those. Okay, after that, we can also we can filter so we can find a certain service. So I can look for my reddit to see where that is running and so on. Yeah, it's pretty cool. There's however more. So if we have look at this dashboard, so this is a service metric dashboard, this is also a default dashboard that you get out of the box. You can see the average CPU memory network traffic and distributed for all of your service, which might not be that useful, but you can open this up and actually drill down so you can see all of your service. But I can also go to Kubernetes, for example, and there I can see there are notes, there are master notes and regular notes. So I can select one of the masters or I can select one of the other notes. So if I split this up for all notes, then I can see per note what the behavior is. So if I click on this one, I would see which note is that I can then pick out that note to inspect it further. There are two other dimensions here. So we can also see that there is this dimension and an interface dimension. So the interface is the network interface. I can see that at the moment we're taking the average over all interfaces, but I can split it across the different interfaces on the on the machine. So if I'm interested in, for example, the network traffic that is going between the notes or publicly, I can click on that interface and see data for that. Same thing for the other interfaces, of course. So this allows you again to start from a very high level and then drill down on certain aspects. Okay, cool. Yeah, I think, or maybe a quick question on this before I forget. So this system really means I can create a dashboard, for example, for my application. And if I have a development namespace, a staging namespace and the production namespace, I can quickly compare the performance between the three. So I don't need to create three dashboards to see the same information pretty much. Yeah, exactly. So it makes it easy to create a general dashboard for which you can select different notes or different cues or what have you. Okay, cool. Probably a little bit of a silly question again, but can I alert on this? Look, the first thing here is alert. Oh yeah. Probably. Okay, so you can have a look at that. So by default, there are some alerts that have been set by co-scale. So you have the average load of the CPU, free disk space and so on, free memory. If I click on this one, so let's say the free disk space, I can edit the event and it's very readable. So I can say if free disk space in percent is less than 10% for five minutes for the servers. So at this point it's for all servers. But I can actually select just a number of servers on which I want to apply this alert rule. For containers, you can also do it for a certain set of containers. So you can, for example, say if you want to alert on the memory used by a Docker container, you can go here. I can type memory and find the memory used by it for Docker. If that is greater than 100 megabytes, five gigabytes in this case, I don't know who that is, for five minutes. And I can then set it on a certain container. So I can do it on the image. So I can say all the containers that are running the Nginx image, I want to do it for that. But I can also do it on a more granular level. So I can do it for a certain replica set only or I can do it for the whole deployment, for example, or for services and so on. So it's very modular. You can do it for one namespace or for all namespaces. It's very easy to create alerts for a very specific thing. You can also see that I'm not selecting containers. So if you drill down you will see actual containers. But that's not very relevant because containers come and go a lot. So you want to do it at a more of an aggregate level. Okay, that makes sense. So okay, so the alert system, you can send an email, I guess, you probably integrate with Slack because it's very popular. But can I take actions from this? Like what if I want to do, I know this sometimes happens and I want to take some more debug steps or like how would I do this? Interesting that you asked this. I have a very good example of this. So here we have high memory on the Calc Ladders deployment. So as I showed you before, this one is if Docker memory usage byte is greater than 200 megabytes for everything that is in the Calc Ladders deployment, then I want to get trigger and alert. So we can see here that actually goes up a lot more than 200 megabytes. And for that, I will set a certain rule. The rule here is that a web hook is being executed. So you can see this URL here. So it's web hook words dot this IP address, and then slash debug slash heap dump. Here in the format in the body, we can see that a token is provided. An action is given here. So the action will be filled in by co-skill. We can read that right here. So the action is either triggered, acknowledged, or resolved. So the web hook will be sent for an alert that's being triggered right now or alert that is being resolved right now. And the server field actually contains the server that is affected by this alert. Makes sense. So this information will be sent to that URL. So if we have a look at our OpenShift dashboard, we can see that this service, the web hook service is also running in our OpenShift deployment. So that's this one right here. I can show you the code. It's a really simple Python program. And what it does, so it exposes a route, debug slash heap dump. It checks the token. This is a very basic form of security right? It should be over HTTPS. And it should check the IP range of the co-skill and so on. So it don't use this in production. But this is a very simple example with a very simple security mechanism. So it checks the token first, then it checks whether the server is present and the action is present. If the action is triggered, so if the alert is created right now, then we will extract the pod theme from the server. So we can do that with the regular expression. We can get the pod name. And then we do a heap dump. So we take a heap dump for that pod. So what is happening? In our system, we see that the memory usage is growing for a certain container. At some point, the alert gets triggered. And we say, okay, trigger this web hook that will take a heap dump. And that heap dump can then be later on analyzed to see which objects are consuming the most space. And you can actually optimize your service with that. So if you look at the take heap dump method, the thing that is being executed, so we take a dump, we upload the dump, and then we do a cleanup. So taking a dump is done using jmap. So this is a Java utility for creating a heap dump from your JVM. We have to fill in the Java process ID. So we use this command right here for that. We use the curl to upload it to a certain FTP server, and then we remove the heap dump from the container. This command is being executed using kubectl. So we do kubectl exec in the pod that was provided by the alert. So the alert provides the container that is having that problem. So we use that here to execute a certain command inside of that container to get the heap dump and to put it onto an FTP server. Does that make sense? I don't know that much about Java, but I guess people that work with Java every day should be pretty excited about this. Okay, cool. So you can take actions. Now I remember you showing us a little bit of application data. Probably the people online will be pretty interested in, okay, how do you now get that in container data into co-scale? So how do you monitor the applications in the container? Okay, so let me go back to our agent page. So the step I just skipped was this step. So where you can define the images. So right here, I have defined that I want to monitor the Redis image with any tag with the Redis plugin. If I click edit, I can see how the Redis plugin is configured. And it is using a certain connection, local host and a port. And it's doing an active check. So this is how I configured my Redis monitoring for this image. Maybe go back to the connection for a sec. If I'm running an OpenShift, I know there's a system that I can automatically generate a password for a service. Would I need to configure this in co-scale for each of these services? So actually, you can provide the environment variables like this. You can just put the environment variable. We will detect that you provided an environment variable here. We'll see that the containers that are running are having that environment variable and we'll fill it in at runtime. So there's no need to set a fixed password and username on your containers. You can still do that using the native mechanisms, using the environment variables, using the config maps. And then here, you can just use the environment variables. So yeah, you can do it on images. Is there another way? Yeah. So if we have a look here, we can see that there's only one thing here. So only Redis is here. But I've been showing you a lot of dashboards which we're monitoring, for example, Nginx and MySQL and RabbitMQ. So what is the trick that we are using right there? Thank you. What is the trick that we are using there? So we can see here that there's another button, generate Docker labels. You can actually configure a plugin. So let's do the same thing as before for Redis. Let's configure a Redis plugin with just the defaults. And then I will get a label. This, this is a Docker label. So this is a label that I can put into my Docker file. And whenever a container is started and the image has a certain label, CoSkill will pick that up and will start to monitoring as defined by this label. So this means as a developer, you can set how your container should be monitored. So for example, if you change something to your container, if you, for example, add a new metric to your GMX metrics, you can just add it here in the label, add the label to your Docker container, and the metric will be automatically picked up when the container is started. So there's no need to run two operations to ask them to add this metric for me for this container. You can do it yourself. You can add the label on, on your container, and then things will be started automatically. Okay, cool. I can, I can show an example of this for our Nginx container, for example. Nice. So this is the Nginx container that we, that we are using. Pretty familiar. Yeah. So we just took a random GitHub, sorry, Docker Hub, carefully selected. Yes. Yes. And there we copy our own files and configuration into it. And then we add the label. So in this label, we can see that we will get information from the Axos local file and from the stats interface. Okay, but I'm seeing local host there. What does that mean? Is that local host on the, the host that's running the containers? That's actually a good question. So the plugins, the way co-scale starts them is the, the agent that is running on all of the nodes will start a plugin for the, for the containers inside of the namespace for those containers. That means that these plugins can see everything that is local to the container. So they can use local host, local host 8000 from seen from within the container. So you actually don't have to expose the sport even. So this is a, a port that only exports the status, status interface. It doesn't have to be exposed as a service or anything like that. It can remain local inside of your container and we will still be able to access it. Same thing for this, for this Axos log file. So normally you would put this into a standard out and get a metric from there. We also support that. So you can put deaf STD out here, but if you have a container that does logging inside of the container, we can manage. So we can actually get to the file inside of the container. No reason to mount it anywhere or so on. So it's all very transparent. You can, you can reason from the, the view of your container. Okay. Very cool. Like what would this mean if I have a development, the staging and a production open shift cluster? Would I need to do something on, on each of these monitored by co-scale or just add the label and. Yes. You just add the label on the container and if the container is running on the first environment will be picked up there and will be start monitoring. Same thing for all the other environments. So there's no change between the different environments. So you can actually test your monitoring on your staging environments, see whether everything works perfect there and then move it into production and you will know that the configuration of your monitoring will be exactly the same. Cool. And then does it matter if I have one or two or a hundred containers of this, this engine X running? No, it will co-skill automatically detect the containers that are running and start monitoring for those automatically. And I can imagine if I have a hundred containers that this is going to have an impact, right? The monitoring is going to take up some resources. What can I expect in this situation? Yeah. So we pride ourselves in being a very lightweight monitoring solution. So we make sure that everything is running very efficiently, that we don't put a lot of burden on your containers and on your hosts. This is really important because we're seeing a lot of containers right now running. It's not like you have one process per machine and you can add 10% overhead there because there's only one process, right? Right now you see, for example, 10 containers running on one note if you add 10 or 5% per container, then that's a huge amount. So you don't want to do that. So we keep our overhead per note limited to 1%. Well, okay. I think that was a lot of information. Thanks for that. So is there anything else you want to show us near the end? Make sure you leave a little time for the Q&A because we've answered some of the questions in chat but there's a few extras here. Yeah, that's the last question from me. I don't have anything special to show you. Okay. Yeah, we did get a lot of questions. I'm going to just quickly take a look. One that I'm curious about, Jonathan's asking too, is where is the Co-Scale UI web console running and if there's a template for it? So do you mean the web interface or? Where are you actually running that? So we have multiple environments. We have an environment in the US, an environment in the EU. So it's SaaS-based, but you can also get this on-premise. You've answered a couple of questions. Maybe, Samuel, take a look at the chat because I think you covered a number of the questions that people were asking here. Yeah, I think we have one difficult question a little bit. So it's a special one. So it's from Jean-François. He's asking if it's possible to run Co-Scale in an environment where you don't have admin privilege. I think that's a difficult one. So we can run in some kind of a degraded mode. So if you don't have privileged mode, for example, the agents will still work and the plugins will still work, but some things that are shielded like, for example, I think the disk metrics are shielded in the proc file system, so we cannot get those without privileged mode. But the plugins will work and you will get gather data, but those specific metrics won't be available. Yeah, just a little bit more limited than normally. Let's see. A couple other questions here popped in. I do think you've answered most of them. So if someone doesn't have privileged mode, what sort of changes do they have to do in the installation? Is there anything special to go a little bit further with Jean-François's question? Yeah, so in that case, the command that you get for installing, you have to change it a little bit. So right here, it tells you to use privileged mode. If you just leave that out, you will be able to deploy it in an environment where you don't have privileged mode and you will get into the degraded scenario. And the reason I'm curious about this, because I get a lot of people asking about monitoring solutions to use when they use dedicated OpenShift dedicated or elsewhere hosted OpenShift deployments. So I think I hear you saying that we could use co-scale if we were using someone else's hosted environment. Is that true? Or is that sort of? Yeah. In most cases, that would work, yes. That's pretty cool. That would be very handy to have a lot of people asking for different monitoring tools for not just for dedicated and online, but and because my experience with services similar to yours is pretty limited to using like New Relic to hack on and debug my own applications, not the operational level that you're showing with Kubernetes. And this is a pretty stunningly deep and operations focused piece. But I think for those of us who are writing applications, this is a great way to really see the impact on the nodes and the use of the underpinning infrastructure that you might not think when you write overly complex applications sometimes. Yeah, definitely. Pretty cool. What else is there in the questions here? Is there any other questions that we haven't answered? We've answered that it can be self-hosted and that's good. And I think we've asked every question here. Can you put your final slide up with how to contact you guys? That way, if anyone has any further questions or anyone watching this video later has a question, that would be a great place to reach out to them and get a hold of them. There's one more. Can we get a viewer report of micro services which are not in use for a longer time? Yes, definitely. So the monitoring starts at the moment the services started up. So if your services are very short-lived, we'll actually start monitoring them at the moment they are started and when they stop or die, then the monitoring will be stopped for them. We also take a lot of scaling considerations into account for this. So if you have a lot of these jobs, we make sure that it's still possible to visualize those and to see those over time because you can get a lot, a lot, a lot of containers in that case. The person who's asking is Deviote and I'm wondering if he wants to just get un-muted and he can ask his question directly. You can find him and do that. He can ask a follow-up. Hi, this is Deviote Guha. Yeah, go for it. Am I audible? Yes, I can hear you. Yeah, my question is more specific to on the services that we are going to host on containers. Now, because of the new native cloud native architecture and various micro services that we are going to build on these containers, there is a possibility that most of the cases there will be a lot of micro services or a lot of application that will be developed may not be in use. Is there any possibility that after a certain time, we can find out which all micro services, which all containers not at all in use and we can probably decommission those containers or micro services which are not in use? So from that perspective, I would like to understand. Yeah, definitely. It's a very good question. It comes back a bit to our webhook thing. So if you have a look, it's difficult to do it as open shift because judging from CPU usage, memory usage, and so on, it's difficult to see whether a container is active or not. However, because we do in container monitoring, we can see, for example, if there are requests coming into these containers. So you have micro services and you see that there are no requests for a certain period of time. You could trigger a webhook with that to scale down that particular service. So that's definitely something that is possible. Thanks. I have a couple of more questions. If you have time, then probably we can or maybe I can drop my queries. Go ask away and let's see if we can get them snuck in here. I have a next meeting in two minutes actually. Okay. I would be glad to take your questions offline so you can drop me an email and we can set up a web conference. That would be great. Awesome, folks. All right. Well, we won't keep you guys too much longer because we did fill up that entire hour, but it was a spectacular show and I'm so pleased that you didn't use slides through the entire thing. So thank you very much because it was really useful information. And I know you guys have a trial capabilities too. So folks on the call if you want to give it a trial and check it out. This is a really great service and hopefully we can use it to gain some insights into our OpenShift deployments. So thanks, Samuel, and thanks, Fred. And anyone who would like to reach out this podcast will be up online at the blog.OpenShift.com site shortly and we'll also put it up on our YouTube channel. So thanks again guys and have a great evening over there in Belgium. Thank you, Diane. Bye. Take care. Bye, everyone.