 Hello everyone. So I hope everyone is enjoying for them and welcome to our talk We are going to talk about challenges in monitoring a distributed storage system and how to handle addresses them So I'm Nishabh Jain and I have a co-presenter with me Gautam and we are from Red Hat and we are open source enthusiast So let's focus on the challenges first now everyone in this room might be knowing what software defined storage is But let me brief it a bit Software defined storage is basically an architecture that separates the software from the underlying hardware and So how many of you have actually worked on a distributed storage system? Great, and how many of you are monitoring or maintaining it? Okay, well, I understand the problems see that you face every day So as you see in the picture, we are we are going to talk about a distributed source system And so there are multiple nodes each of them can communicate with each other each node has a device attached to it multiple device attached to it and a software that wraps up wraps up the complete storage system and Yeah, the client can directly talk to the software software provide the necessary resources and It gets gets back to the client So, okay, let's talk about monitoring first. So what is monitoring in so in simple words you can say that monitoring is observing the system and its resources and And maybe finding out if possible finding out the exact problem If in case a problem occurs So the most traditional way to to get a to monitor node is by the CLI commands So you might have seen you might have used simple commands like a stop or or VM stat and you will be able to find out the CPU utilization as well as rest of the utilization and as well as network related Issues but but if if you have a single instance, it's pretty easy to monitor The particular node, but right now we are talking about a distributed source system So what if you have multiple nodes? Well, in case your cluster has four nodes. I still easier to run Command-length tools and find out the Utilization resource utilization, but what if your number of nodes in the cluster increases well the complexity also increases Well, and what if your system is hundred nodes where someone is going to get paranoid Okay, but is this is this really the best way to monitor distributed system? So let's consider you are super admin and you find out happy ways to monitor the system But then are you going to sit in front of the computer 24-7 just to monitor your cluster? Well, I guess not so This this is the big problem that we face in monitoring, but let's discuss another challenge right now So what happens so if you if you have seen the previous in my architecture diagram You you you might have seen that everything was working perfectly But what if something fails what if like you can see the network has failed or the device is failing then? Then any admin will literally go crazy and might have to debug the solution debug the problem to find a solution Well, the best way to debug the problem right now or to find out the reason of the failure is to the logs thankfully all SDS provides Extensive logging through which one can debug the problem But then we end up with this Opening multiple nodes multiple terminals on each node and trying to debug. Okay, where the problem exactly occurred And we will end up with this again So is this is this really the best way to debug the problem going through the logs of on all the nodes so the main challenge exactly we that we face in in a failure scenario is to know why the system failed and Where it failed and when did it fail but going through the through the logs is not a good solution So we discuss three main challenges The first one was monitoring multiple nodes simultaneously The second one was monitoring all the nodes 24 by 7 and finding out the exact problem in the limited amount of time So let's see how tend to exactly addresses them But let me brief you up a bit about dental so tender is an open source monitoring tool for Gluster and it provides deep metrics Visualization visualization of Gluster clusters and it suppose multiple clusters, but then what is Gluster? Well, Gluster is a product. It is the software defines storage for a distributed storage system. And yes, it is open source Okay. Yep. This is the this is the tender architecture So we split that architecture into two different parts One is the the services which are all which are all running in a server side and the service Services which are all running in a storage node size so Based on these services we classified classified these services into three major layers So the layer which is highlighted in a green is called a storage node layer and The layer which is highlighted in a blue called middle layer and the layer which is highlighted in a red is called visualization layer So we will discuss each layer in detail first. We will discuss about the storage node layer So storage node layer contains three major Services, which we are running for monitoring. So first one is a node agent. So node agent is the actually an important Service, so it will what the main task of node agent is it will Try to detect whether that node have a Gluster file system or what if it is fine so it will Intimate the Gluster detail in it will push that Gluster detail into ETCD and next one Gluster integration service. So this will This will interact with the Gluster Gluster using some Gluster CLI commands and it will I try to identify that Gluster topology and and it will push the details into ETCD. So it will Try to find the topology at with status details like how many volumes Volumes the volumes are up or down or bricks are up or down whether that peer in a cluster is everything is connected or disconnected something like that and Third third is collect D. So this is a monitoring service actually this will collect the monitoring matrix from the system resources like CPU memory and and from the Gluster Gluster cluster. So it will It will collect all the monitoring matrix and then it will push the all the monitoring matrix into graphite Yeah, we finished with the storage node layer. Now we move to the second one monitoring layer So monitoring layer have only one service Then called monitoring integration. So if the monitoring integration have a three major responsible so one is it will upload the Monitoring dashboard into graphana and Second one is it will fetch the Gluster topology from ETCD And then it will do some aggregation work like from these many volumes How many are up and from these many bricks how many are up and how many are down? So these kind of aggregation task Operation it will do and then it will push the those detail into graphite and third third one is it will try to It will receive the alerts from graphana and it will Intimate the user via some notifications. Okay, we We already finished with the two layers. Now we will discuss about the The last layer visualization layer. So this contains two major parts Just for visualization visualization the metrics and topology So one is web administration UI or you can call it as a tender UI and it will display the Gluster topology in a user understandable way like using that user can drill down and then you they can Easily identify like how many the volumes and clusters in the Volumes in a cluster and what are the nodes participating in a cluster some and Second one is a graphana dashboard. So we are using a graphana tool for monitoring. So Grafana will it will fetch them on monitoring metrics from graphite and it will display the those the monitoring details in graph using graph and panels well Let's dig a bit deeper into the visualization layer and what's the best way to see visualization by seeing Apart from seeing graph. So as you can see, this is a cluster dashboard and we are we are using a graphana dashboard So in this you will You can see the different the number of hosts and number of bricks or number of volumes that are present in the particular cluster and you can see the status of the cluster and So using this dashboard of admin can easily know what what all things are working and what all things are not working in the system So now let's suppose our device fails. So before Moving forward, let me explain a bit about cluster topology. So a device is a physical entity and as we all know but what the cluster does is it's it formats the device it places a particular File system on it and it mounts it so it we call it a brick and these bricks are used to create volumes This is this is how things work in cluster So now let's get back to that. What what is device fails? So as you can see in the cluster dashboard, you will be able to see that The one of the bricks have stopped and the volumes and one of the volume that was initially was in the upstate has Has been marked as degraded, but this doesn't give you enough information about what which brick failed or where where is the brick placed? Let's see. Let's see host as host host dashboard in the host dashboard You can see so using the host dashboard. You can see which exact host the brick was placed on which exact host and That how many bricks have actually failed but still it doesn't give you the information about when did the brick fail? So let's see the big dashboard now So every time you can drill down a bit deeper. So it was to find the right solution So in a brick dashboard, you will be able to see that which host I had which break and at what particular time the brick failed via the graphs So we were facing three challenges Which was when the when the system failed where the system failed and why the system failed the popular 3w's? Well, let let's see how tenders also problem. So with the perfect drill downs You are now able to figure out when the system exact where the system exactly failed and the graphs Get updated every 10 seconds. So So if any change in the graph if you if you can figure through the change in the graph You can figure out the time when when the system failed and using the about the information that you just figured You will be able to find out or you will be able to filter the logs and find out the exact reason of the failure Yeah, as Told now we have a great dashboard to see the monitoring I mean monitoring all the all the nodes in our cluster But still we have to we are we going to hire a admin to monitor the system 24 by 7 So the problem still persists So to solve that problem tenderly is I mean tender supports alerting so Yeah, so yeah, any any Any abnormal behavior happen in a cluster Tendle automatically deduct deduct deduct the problem and then it will intimate the user via Alerts and also it will send the notification to the user via some notification like SMTPs So actually tender supports two types of alert one is status alert and one is utilization alert so status as I told status alert is raised from that Storage node layers like the particularly that the cluster integration service is actually raising the status alert. So status alert like Yeah, we saw that in In a dashboard, right? So the alerts like a bridge is down and volume is down volume is degraded something So the here you can see in a UI so few alerts are a few status alerts are raised And next one is utilization alert. So most most mostly I mean the utilization alerts are raised from Grafana base actually tender will Tendle automatically Figure out that the criticality of alerts and then based on the criticality It will raise that alerts like warning alert critical or something like that And we will see through demo and we will see how actually tender is raising the alerts So let's see a demo video Okay, so this is a basic tender UI This is just one page of it and over here you can see the that we have monitoring one cluster we have three hosts and we and there are zero alerts right now and this is this is and This is where there is notifications are displayed right now There are only info alerts and no so no major alerts So this is the cluster dashboard that we saw everything is working fine right now You can see this is a cluster overview dashboard Then this is the host dashboard which gives you information about all the hosts and the bricks like we saw in the slides as well And we'll now see brick dashboard. So in the brick dashboard, you will get details about brick As you can see this is capacity utilization, which is 1% for a particular brick and if I change to another brick The capital utilization is 10% So now let's increase the knowledge fill some data into the Volume and find out what happens when we actually cross the threshold So, okay, I'll just show you via df-i-1-h that how much capacity is there So as you can see on all the three nodes This minute. I have highlighted on all the three nodes So as you can see there is only 2% of bricks has been used over there also and on the third node as well so now I'm going to Put some data into the volume through the f-allocate function and it's just dummy data and Then we'll see how the capacity increases of the volume and Whether whether it's displayed in the dashboard or not So as you can see on the third node and every other node the capacity has increased to 86% and Let's check the dashboards now. Well, it's super fast We got the results in less than 10 seconds and now you can see in every node that the Utilization has increased to 85.5% Well, and now now let's see the tender UI as you can see there are warning alerts in in the notification mentioning that that brick utilization has of on each node has increased to as a Cross the particular threshold Okay, now let's let's so there were two alerts, right? So right now. We have just we've got warning alerts Now let's fill more data and try to get critical alerts so as you can see that the data the percentage of The usage percentage is 96% now, which is above 90% which is which was our capacity which was our Critical alert layer and so as you can see in the dashboards again It's reflected and in the tender UI the warning alerts have been replaced by the critical alerts So we you don't get a new alert you get you the previous alert gets updated and And and and you can see and you can see in the alerts section It's also managing the number of alerts there are in this particular cluster Now let's remove the data and see if the dash where the system can recover and show that So whether the alert Basically remove the alerts or not So I'm removing all the data from all the three nodes as you can see the capital utilization dashboard has been Back to normal and over here from the critical alerts They have become information alerts that there was a critical there was a critical scenario, but now it has been solved well Okay, so that's all from our side Any questions No logs in the sense of what this Okay, yeah so The question was that am I mentioning about this logs or am I just mentioning about the logs of the particular SDS, right? So I'm mentioning the log so Gluster is an SDS, which is a software defense storage and It it logs it messages on each node, right? So I'm talking about those logs So I did mention that every SDS has extensive logging capability capabilities But to if I want to if I want to find a solution then I have to go to every node and find check the logs to find The reason of failure is that good enough? Does that answer the question? No, I'm what I'm doing is I'm giving you the exact node where the problem has occurred So you can in so you can pinpoint the that you can go to a node and check the logs At what time it Yeah, it's basically useful for filtering the logs Shoot yeah, no, we don't do that. We don't push the logs to HDD Okay, sorry. So his question was that why if you're putting pushing that Logs to the HD and then why do not use ours is log, right? Yeah, we don't we don't push the log So basically this tool is made for Gluster which logs it on different different nodes each on each node It's so if a SDS is logging on particular Particular centralized database then I guess it makes more sense of using something like ours is No, so I'm sorry. I think okay The question was again that so the question was that if a particular Failure occurs on multiple nodes, then will I will I be a do will I do with the software do a log analysis on each node, right? so Tendril is a monitoring tool as such and it doesn't do any log analysis log analysis is something that means it Finding out the solute the reason of the failure is Manually only we don't provide any such feature what we do is we will give you the exact the place where the thing Failure have occurred the time so that you can filter the log by also by maybe using some other tools like or some log aggregation tools Or like Kibana, I think so actually supporting To add to that actually Gluster is supporting Gluster event So anything anything affected in multiple nodes the other Gluster events is actually raised from the all the nodes actually So what Tendril will do Tendril capture that alert and then it will highlight to the user actually so then user can easily see So that so these are events that Tendril displays so that event will be displayed over here So kind of history it will maintain actually You can track it any other question. No, actually Tendril is only for Okay, the question is like we can Actually the question is like we can we manage the cluster or what like create brick and something like that So actually tender is only for monitoring. It's not for doing any creation It's a monitoring tool instead of a managing tool right now Yeah, shoot So the question was Which back and basically database do we use for Grafana and it's it's we use graphite So we stored the data from Gluster to graphite and then we displayed using shoot so the question was that Can we can we send data from the data that has been collected by Tendril send data to some other monitoring tool so no we we provide a complete package as such from back end to the front end and This package can be You can monitor everything in Grafana properly in one dashboard or like as you can see drill terms But you cannot use those data or if you like we push the data to graphite If you want to use the data from graphite, you can as an open source tool Yeah, so you can pull most of the data from graphite and then you can use it for your personal use or monitoring purposes, but as such this kind of integration is not provided by Tendril