 I work at the Platforms and Services team of Capuleti. And I'll be presenting a talk on real-time distributed monitoring, execution, and everything using open source tools, Nazios, and Candida. So, when we started off as a company, we had a small input or two-server setup and a LAM status model. Everything used to happen on the same place. As in when the company grew, the modules went out to a more distributed space and gets captured into different servers which are across different places, which gets difficult to manage. So, monitoring becomes very important because without an idea of that, we cannot know if something is going wrong, if what is going wrong, or if we have to scale in certain dimensions how to go about that. So, Ganglia is an open source tool that collects data from different servers and makes it available at a central location. So, this is an interface for Ganglia which shows few system and application level metrics. So, like this is the data collected for application server, my single server, our services, and the memcache server over a period of a week. And this is the load utilization in that cluster. So, this will kind of get read when load utilization is maximum and we move when it's zero, green when it's okay to move. So, basically a scalable distributed monitoring system uses pre-existing tools and technologies that are proven and builds upon that like XML, XDR, and RIT tools. It uses XML for all kinds of data queries that happen everywhere. It uses XDR for portable data transport and RIT tools for storing the data onto this. It has many stable and active deployments across the world and many big sites like NVIDIA, Twitter, Twitter and whatnot. So, what are some of the good features that we'll find in Ganglia? The first of all is it's very easy to configure and very easy to integrate with other software. We do not need to kind of change configuration every time a new server gets added to the cluster or a new metric gets added to the server, which will get very hectic if you have options like matches which need to reconfigure. Other than that, there are lots of open source plugins available which can plug directly into most open source software available that we use MySQL, Apache Tomcat and a lot more. And even if that does not meet our requirements, to add metrics is very easy. You can write a simple Python file inside any file. You can call an executable and send the values. The data store is redundant. It's stored at multiple different locations wherever you choose and design. The monitoring demand is very lightweight. Consumes less than 10 MPs of RAM and almost no GCP. It effectively monitors both system and application level metrics and has plugged in data for many application level metrics and can be extended to include customized application level metrics for NVIDIA software. Importantly, all of that data is collected centrally and made available at one location. The sense of that data can be made out using different tools. Can we develop and plug in more? Yeah, we can develop the plugins in that direction. But what Ganglia does is collects metrics. So metrics are physical quantities which have numerically representable values. So if you can break down your logs into metrics which have counts, which have numerically representable value, you can easily write a script or plug in that and do whatever calculation on your log you want and in the final step, send the value out to the values. It's as easy as executing a command. Which metric name and value. So this is kind of the internal working diagram of Ganglia. It has three major components. So one is Gmetati, one is Gmodi and the other is the web. So Gmodi are the monitoring nodes which run on all the servers that we want to monitor. These are very light loads. These only collect data and send the data to Gmetati. Gmetati collects and aggregates all the data and stores all the data onto disk using R&D tool. And the gm end accesses the Gmetati data and makes it accessible inside the browser with the paste. Why is it called distributed monitoring? The monitoring collects data across the distributed system and makes them centrally available. After that, to analyze the data and everything can be done centrally. Distributed in that collection is distributed. No deeper into Ganglia has entered us. The two major pathways of sending metrics to Ganglia are the gmodi service and the gmetric. So the gmodi service has its own internal scheduler which calls the plugins. So there will be overmetrics which are initial system metrics which are uploaded by default. Custom plugins can be added to extend the functionality and add new metrics to that. So this scheduler will just call this custom metric and you can define your own function in that to update the logic and to create a new logic. But the scheduling and all will be done by scheduler and it will collect at different rate intervals and different threshold intervals across different plugins and send it out via UDP directly. It can send via UDP via TCP and UDP both multicast and unicast but on most cloud platforms UDP multicast doesn't work. So UDP unicast is what we use. It picks up its configuration from ETC Ganglia gmodi.com file that is a file which stores everything that is to configure gmodi. It's a simple one by configuration. This is stored on every client? Yeah, this gmodi connection code is installed on every client where we wish to collect it. Does it come from the back end? So it kind of does some things which Naju does and some things which it doesn't but it's different in its architecture and scales very nice compared to Naju's. The next mode of sending data to gmodi is the geometric executable. It's a simple binary executable which any program can call giving the name of a metric and the value for a metric and that metric will immediately show up in the whole system of metrics. So we can have asynchronous data sources which just call that metric. Some data sources we won't know beforehand how often it will happen in an hour or so. Say a long event happens or an event happens. Can't be included directly into this kind of a plugin. So can directly be included into a geometric kind of module but when that happens, geometric executable can be called with the value and that value will show up live in the whole system. Or if we wish to make small scripts around geometric we can have a ground wrapper which does the schedule. The other end of collection services where all the data coming from gmodi is collected is called the gmetardyservice. This also runs as a normal Linux service and has two major ends, the collection and aggregation ends. It collects all the data, sums up data and builds various summary information and then passes the data on to RITs to store it in download and database files. The tool stores it in where we are at. Again gmetardy picks up its configuration from etcganglia.gmetardy.com. gmetardy also has an XML query interface where over which all the data that is collected can be accessed and varied. So this is where we can make sense of our centrally collected like data. We can like whatever logic we want to make sense of that. gmetardyservice is one possible option. It's the default web front-end that ships with gandia to make sense of that data. It gets its metadata from that query interface and gets graphed directly from RIT tools. So the graph that we saw was just the directory tool and serves the result to the browser. So one other thing which we have over this data is we can connect badges to it. So over the system of all the metrics that are being collected across our service funds this can perform alerts and checks as to if something is going wrong. It can SMS and emails immediately if something is wrong. So it depends on the setting. It's all configurable. You can collect it as frequently as 15 every 15 seconds. So it's mostly scalable in all directions. Collection is very scalable. It's very light and uses non-TCP transport. The data gathering and aggregation is also very scalable. The only thing which kind of hits scalability is when the data is being returned to disk. So if the server farm scales to say tens of thousands of servers and tens of thousands of metrics in that case the disk time will kind of click because all those metrics will be currently getting right into the disk. Other than that, for that we can have strategies to overcome that and other than that its scales very nice and it is used on Wikipedia as like 2,000 to 3,000 server cluster directly on the area. So according to the data collected anything is wrong which is analogic. Najeev will send us an alert anyway. Alerting if something is wrong. So is there an RTL plugin? No. So Najeev just runs in a silent mode. It does not connect with any NRP. It just queries this for all the data. All the data is locally available on the same system. These are both running on the same system. So this just collects data and Najeev is only using for the alert things. It has no security for that. But Najeev doesn't scale as well as I think. So this is kind of our current deployment phase. So our current servers can be divided into three major categories. Our TV servers which have a master set type of configuration. Our app server cluster and our services servers which are basically servers which host long running Java services. So on each of these clients on each of these servers a Gmonde will be installed and will be configured to collect certain metrics. On all TV servers it will collect all bicycle related metrics or disk IO related metrics. All Gmonde will collect a common set of metrics but certain metrics will be out more industrial for certain servers. So Gmonde will collect all TV related metrics. This will collect all apache related metrics or analyze the law of apache and produce the same numbers. And our Java services have JMX exported beans which have all of their statistics directly available. So we can directly plug it into our Gmonde and all that metric is available at the central base. So this is our saver where GmetaD is installed. All these Gmonde send data or UDP to this Gmonde. This Gmonde collects all that data and this GmetaD node running here gets all that data from here and writes it to disk procedure. All of these are in cloud the mailer is not in cloud so it can be anywhere. Possible things that might go wrong. One is the point you guys. So now in the end of the year we are seeing Gmonde in the 2-3 cloud platforms. So Gmonde might first of all, Gmonde is in a very active stage of development so sometimes some memory related plugs gets all. Other than that, Gmonde stores all the the whole of the store has all the servers that are there. Gmonde Have you seen it taking too much time? No, I have not seen it taking too much time but its memory gets increased. But in our setup it doesn't increase that much. But it will store the current data of all the servers information that it has. Gmonde does not store that but is that the disk? Gmonde has a memory hash for all the servers that are there so it can be accessed fast. And Gmonde is where all this Gmonde gets data from. Gmonde stores them in memory Gmonde pulls that from Gmonde and stores them to disk. So this guy is one of the major things because all of the other ends are more or less scalable as compared to this guy. So when the last number of metrics get written on a normal hard disk kind of environment it will kind of it be. So what can be done to overcome that? One of the options could be to use a ram file system where all the RIT files are stored directly in a ram file. So disk what like won't be there. But it has the disadvantage that you cannot persist that data. Or you might have redundant setup one might be persisting data or one might not. RIT cache D has also evolved which kind of caches all the data that RIT gets and periodically flushes them to disk. So other than that data retention is resolution. So Gmonde does not retain all the data that it has. By default more than a year back it will lose all data. For the current R it will keep all the data. For the past day it will keep averages for each R. Similarly for past week it will keep averages for each day. So resolution of data kind of changes. This is the configurable property again. If we want we can configure the RIT in such a way that it stores all the data but that might be memory intensive. Yeah, this number of collectors is scalable. This sends the data out on UDP. Ideally and normally it sends data out on UDP multicast. So it will just send the data out to all the nodes on the network. Whichever one you collect can collect. And you might have two nodes monitoring this cluster and again another cluster which is being monitored by two nodes and another computer which sits on top of those good nodes. That's kind of how it's designed. So like I said all that data which is available in a single place can be used in different ways. So one of the first things that is required is to raise alerts when something goes wrong. The distributed system has a lot of interlink components and modules here and there. A lot of things can go wrong. So to get an idea of what is going wrong and when it's going wrong. It's going wrong. So we use Nagios for doing that. Nagios has a very good alert system. Our Nagios setup uses local data only. And we have made some custom changes to Nagios as well. So we can set threshold on metrics across our server from. We can set thresholds on expression of metrics. Like if say the metrics were load average one minute load. And I wanted to set a threshold of say one per CPU. A load of one on one CPU or two on two CPU. I wanted to set that kind of a threshold. So what I can do is directly write an expression load by CPU account. Both of these are metrics available and set a threshold on this. And even such complex expressions can be found over all the metrics available across our servers. And even such complex conditions can be which are used for a lot of our services. Nagios traditionally relies on a framework where you have to kind of go and change config file every time you have to do a change and then reload. So which is Ganglia does that automatically. Which is only additional trouble for. So to do that we move all the configurations we can out of Nagios. One of the basic things we do is name servers uniformly and make regular expression rules in Nagios. So like all DB will be DB servers, all DB can be targeted here this way. If a master comes here it will be master. And similarly if a DB slave comes up the rules will be qualified directly. When Nagios runs on a large cluster it will when some errors are thrown it will fail with a lot of errors in certain cases. What happens is you have to kind of look into all of those errors and find out what are the root cause of the error. Say if MySQL was not running it will show errors to MySQL is not running replication on MySQL is not running and slave lag on MySQL replication is critical. This is a simple case where only MySQL and slave are flexible. But on a large server it will show thousands of errors. And to find what was the root cause of error will be again time taking and which is not required. So it has a couple of service dependencies which we can directly use on our data. So you can example of how this works for this simple example. So the first check all of these checks are Nagios services which run on a host and do a service check. The result can be either pass or fail. So the first check that is done is if the Nagios connector is working or not. If it is not working all the errors should not come out because it is the most basic level failure. So if it fails it reverts failure and ends and no other path is worked. If this passes for this example it goes through two paths. First it goes on the master dv server and sees if MySQL is running or not. So we have written a custom kind of collector so we can find erroneous cases or whatever is not working according to that. It has its own local cache, fileback cache. So if it is time time is not updating Gaglia also supplies for all the metrics the freshness of the metric. So if the freshness of the metric is too far off this thing is not working. So three or four of the conditions can come if the connector is working. So if the MySQL is not working on dv master it will immediately fail, report failure and report no other failures. If that passes it goes and checks if the number of threads connected connected reaches the threshold or not. If it reaches the threshold it will again fail with report failure otherwise it will pass. The other part is it will go to the slave and check if MySQL is running or not. Only if MySQL is running will it check for the MySQL replication status and only if the replication status is okay will it check for the threshold on MySQL's slave lab. So in all these cases the base error gets thrown whenever an error happens. So this gets thrown when this happens. This gets thrown before that, this gets thrown before that and whenever this errors happens those errors won't be reported. So we gotta get mega of the root. So just to give an idea what our current servers are running a few days back. So using some 2000 metrics on some 20 odd servers the normal gmd since it's around 10 mb of RAM gmeta d2.com 4 mb of RAM and as used to around 6 mb of RAM which is better compared to value scale and we get alerts within 5 minutes of things well done. I'll now walk you through a small demo of how to set up Kangya. I'll walk you through a small demo of how to set up Kangya or no way. In the meanwhile if you could, if there are questions I'll just turn my laptop off and reboot it. 5 minutes response time and what is it? We use active checks. We don't use Nazios and NRPE and all those solutions to collect data from Nazios doesn't do any data collection all the data is from Nazios. All that 5 minutes and everything is very predictable. It's all moving through. You can change any values to your liking. But if you update your metrics too frequently, you'll again hit this guy over. We have different periods for different types of queries. Some metrics need time to get collected or need computation to get collected. So we don't collect them as frequently as metrics that are just generally available. Consent that to as many as possible. So we have this two demo machines installed. One is demo and one is new demo. Demo has a running instance of Jimondi and Jimetati and we can check out its Kangya just a default running instance. So these are the pre-loaded metrics and everything that can be directly visible. This is the overview graph of the memory utilization currently. This is the current CPU utilization. This is the current cluster utilization but it's only a one server cluster. Here all the servers will be usable. Yeah, so all of these graphs can now be zoomed into using drag. So if you leave it it will zoom to a time frame which is there will be various metrics which are corrected for demo. So one plugin which I have installed additionally is this plugin which checks the current state of batches. Other than that, CPU metrics are available by default. These are all system metrics which I have installed by default. And disk metrics are available by default. So what I am doing is I will install Jimondi on a new server called new demo and configure it to send data to disk base. So I have this pre-made devian packages which can directly installed on configure Gambia. On our production system we can have simple scripts which take care of automating all this stuff. This is the configuration file for Jimondi. All the configurations picked up from it. I set the host name and by default it is configured to send data on multicast which like I said won't call directly on most cloud solutions. So I have not called this multicast on this. I had a new send channel to the host that I have GMATATIC on which is demo. So a new demo host already shows up here and soon that metrics will show up. The new demo host which on which I just installed. Only the metric metadata is loaded now. It has this Python file and the config file. Config file needs to go into the config folder which needs to go into the module folder after that this plugin will be directly available as it is available. So these basic metrics have started showing up value here. These Apache metrics are now coming through. Now for these Apache metrics only the metadata has been loaded. So all are not available. I will show you about it. I will just hit the new demo Apache server with multiple requests so that some values get shown up there. These values have started to appear and will show up here.