 OK, so I think I'm going to get started. So hello, everyone. My name is Brunad Lee. I'm one of the project administrators of the Gangler project. So today I'm going to talk to you about Gangler, which is a monitoring software. And the project has been around for 10 years. So we're going to tell you about what the software does and why you would want to use it to monitor your computers, specifically clusters, grids, or just web farms. So actually, before we begin, I just wanted to have a quick poll. So who has heard of Gangler? OK, so who has actually used it? OK, so it seems like a fair amount. Maybe you guys know quite a bit about it already. OK, so well, actually, let me just briefly introduce myself. So I'm Brunad, so I've been working on high-performance computing related open source software. So I've worked on provisioning tools like System Imager, Oscar, and monitoring site, working on Gangler. So for the past few years, that's what I've been working, just getting involved with a lot of the open source software. So Gangler, so what is Gangler? So the goal is, basically, the software is to gather system resource metrics in real time and so that you can figure out what your host is doing. So you have even one host to hundreds to thousands of computers, you set it up and they're running. But you want to find out what a system resource is like and what it's doing. So when you have set a system, you want to have a software that gives you a centralized view of what's going on. So that's what Gangler's goal is. So the project started around 1999 by Matt Massey. So it started at University of California at Berkeley. So it's part of the Millennium project. So it's a project that was involved with building clusters and they wanted to find out what the cluster's doing, how his system load is, and things like that. So Matt wrote this software and for the past 10 years, basically, you think of monitoring for cluster. You think of Gangler. It's somewhat become the facto standard for monitoring system resources. So it's a very lightweight process. So when you monitor these systems, you don't want your monitoring daemon to actually take up a lot of resources because that would be very wasteful. You actually want to do real work on your computer. So if you're monitoring software in the way of that, that sort of defeats the purpose. So it's very lightweight in terms of CPU and memory usage doesn't use that much resources. And basically, you have a monitoring daemon called GmonD and you run it on every node. And there's basically all the metrics that's collected on each host is aggregated on a separate server, which runs the GmetaD daemon. And these metrics are aggregated into round-robin database files. So ROD files are basically time slice data. So it's good for storing these metric data so that you can go back in time and look at what your system has been doing in the past. Again, it's a very lightweight agent. It supports most Unix Linux systems and even Windows via Seguin. So basically, you can run it on anything. And it doesn't matter whether you're running the gangler on these different OSes, they will all work with each other. So you can have a mixed hybrid system as in many large corporations, you would run different OSes and then you can use one, two to basically monitor everything. Keep pressing the wrong button. Okay, so it's BSD license, open source license. So, what am I gonna talk about? So basically what you can do with this software, what does it look like when you actually use it as a user and a bit about the architecture and some advanced topics. So by default, gangler would collect FURTY or so metrics about your host, like CPU load, memory, network, and all that stuff. So that's the default one that's collected by default. But if you want to collect your own metrics like how your Apache server is doing, your MAM Cache D or just basically anything you can somehow collect from your operating system, you can pluck these information into gangler. So I'll go into it a little bit in detail. How what, you know, gangler is very scalable, but so we're gonna talk about some issues when you run into like thousands of hosts or tens of thousands of hosts. And, you know, cloud computing is quite a hot topic nowadays. So just to give you some brief notes about what, you know, the environment is like if you want to use gangler to monitor it. And then we'll have Daniel Pocock just come up and give some user testimonial. And then, you know, basically I'll end with, you know, how you can get started and get involved with the project. So typical users, you know, the project came about with from high performance computing. So these are clusters of computers basically have one goal and it's just to crunch a lot of numbers, you know, run a lot of parallel code. And gangler came about and it makes it very easy to like figure out like what your cluster is doing. So, you know, it has this hierarchy of like a grid and a cluster so that you can find out like, you know, if your cluster's doing, how your cluster's doing on one end and then on the other clusters, like you can sort of aggregate all the different data. And, you know, launch enterprises, you know, you have, you know, different servers like web servers, database servers, you know, you have a large corporation that you have many computers that do different things. So you can use gangler to, you know, check how these servers are performing. And then you can, you know, go back in time, look at the history and, you know, figure out, you know, what's going on. So it's pretty similar uses. And like, you know, in your IT environment, you know, you have support issues and like why is your system not performing as you think it should. So you can look at, you can also use it to look at like memory utilization and, you know, when your servers reach a certain load and maybe it's time to buy like new computers or actually upgrade your memory or whatever. So you can use gangler to look at all these, you know, pretty graphs and give you an idea of how your systems are performing. And then, you know, you can see, okay, if a whole bunch of like servers or have really high load, maybe you can like shift the load around and maybe like even virtualize it. So, and you can use it to troubleshoot applications. So like you have different users running, you know, different code on your computer and maybe you're trying to figure out like why is causing this, like first of all, you need to know that your system's having high IO load, but how would you tell it? Like if you have a thousands computer, you're not gonna log into each one to do a top and like figure it out. So with something like gangler, it's like has this graphs with aggregated like information. So, you know, basically you can see very quickly that, you know, what your systems are doing. And with that information that helps you troubleshoot, you know, application problems and things like that. And, you know, when you're writing new software, sometimes you don't know like, you know, how it performs and, you know, you use it, like you don't know how much resources it uses. So again, you know, gangler is useful for these kind of work loads. So just give you an example of like some people who use this gangler. So these are just like names you can find out from our website. It is like a little bar on the side that tells you like who uses gangler. So I like to point out especially about Flickr. So I know the previous operations manager who was always say like, you know, use gangler. What does he use it for? It's for capacity planning. It's like, you have these graphs that tells you, okay, well, we hit sort of hit the resource wall. It's maybe it's time to buy like new computers. So you would go up, go to your supervisor, your manager and say, okay, well, I mean, this is the real load. And you know, we need more computers to handle these loads. So it's good for that. So let me give you a quick demo of what gangler looks like. So this is, you know, the Berkeley grid. So it's divided into different. So you see here, you have a main grid. So this is sort of like the top level. It aggregates all the metrics you see at the bottom here. So this is one cluster, which you can click into. So this red line just tells you, okay, this is the max, the number of CPUs in this cluster. And so, you know, the number of running processes here, this is the grays stuff is like the load. So it's a whole bunch of different type of charts that you can see. So it's like memory, network. So down here, these are individual hosts. So red here means it's sort of a high load and green means it's not that busy. So again, you can click into it here and see what each individual host is doing. So this is one host. It says, you know, it's been up since this time. And you know, it gives you a whole, a lot of information. But so basically all these metrics is collected on the host level and it aggregated up to the top. So a collection of hosts is a cluster and a collection of clusters a grid. So you can actually even have like a grid of grids so that, you know, you can sort of aggregate it, you know, all the way up. So these are stats of individual hosts. So it's pretty self-explanatory. Okay. Okay, so let's talk a bit about architecture. So every node runs the GMOND agent. So that's like what you run on individual hosts. And so it doesn't keep any like historic data locally. So it's just, you know, just the data just sends around. So the data is transmitted. So the metric data is transmitted by default. It uses multicast. So in environments where like multicast could be considered like chatty, like you don't want to send to me packets. What you could do is you can use, so you can use Unicast UDP packet. So that, you know, reduces the amount of like network traffic. And then basically you have this GMetaD server that aggregates all the data and stores in our default. So I think I mentioned that previously already. And then all this information is that presented on the web server, which basically, you know, serves the web page you saw. And, you know, you install that web server like Apache or like Lightning or whatever. And you basically runs on the same server as your GMetaD process. And it's used to like create the graphs and the charts that you see. So let's just run through what it looks like. So by default it uses multicast because it's very easy to set up. It's basically setting up is just, you know, it just start the daemon, the configuration by default would use multicast. So every node would transmit its own metrics to the multicast group. So you don't need to do anything special. And every node would receive metric from each other. So essentially you talk to one node, it would know the metrics information of the entire, everything, every node from that, from that multicast group. And, you know, it has, yeah, again, it just, every node would already know like what the metric information of the other guys. And the node can actually be pulled by a specific port and then it'll give you like an XML sort of output of what like the metric looks like. And that's what we use sort of to send the information around. So then we have the gmedity server and the web server that aggregates all the data. So the gmedity server pose in the multicast environment, it pose any one of them. So if, for those of you who have used it in the example, it sort of eludes to that you have to add each host. Like, so there's a data source that points to like a particular gmod host. So the configuration sort of eludes that you have to put every host in the multicast group but it's not necessary because actually it's mainly for redundancy. So in case like, if your multicast group, one host goes down, you can go to the other guys. But basically you just pull one host in the group and then you get, you know, you'll get all the information of all the hosts. And so RD files are created and, you know, to store the metrics. And then from the web browser, you can see the graphs and the charts and to see what your installation's doing. So I'm just gonna talk about some advanced topics. So I mentioned previously, there by default it, Ganglite collects all these standard metrics but what if you will have your own like metric that you want to collect that's not part of the standard metric. So we have this command line tool called Geometric that you can basically feed it metrics. And typically you run it, you either write a script that gets all this data like you write a program to get like the temperature reading of your host, like usually there's some command that you can run and you would feed it to Geometric and then your cron job would run it like every couple minutes just to feed it the information. So in newer versions of Ganglite there's a, we wrote a module interface to GMOND so you don't need to use GMATRIC anymore. So basically you could write C or Python code that and then the modules has callback and basically you set a value for like how often GMOND would get the metric and just snippets of code that you write to basically collect the data. So I'll show, I think the next slide shows you how it works. So in this case you don't need to worry about like having a cron job. So the GMOND process would be in charge of like periodically getting all this information. So this is just a pretty stripped down example of what this module interface looks like. So the first definition is basically what this does is it generates a random number and it feeds it into GMOND. So the first one basically does all your work. So back to the temperature example you would write some code to get the temperature reading and then you would have a, you would init your metric and just feed it in. So the team, the time max here is just like how long it takes before you would feed it the data. And the unit here is like, you know, it's just an integer. It's actually pretty straightforward. So there's in the Wiki page we have some document on how to write these modules. So like, so Gangler is designed to monitor a lot of computers. So we noticed that, you know, when you have a thousand computers you start to have this scalability issue. So what the problem is is that by default you have like 30 metrics and if you have a thousand computers that's like 30,000 metrics. So all these each metrics when it's collected you have to write it out to ROD files. So that means there's a lot of like IO happening. So previously what we did was to put the ROD files on tempFS. So it's basically just just RAM. So it's really speedy. So that sort of alleviate the problem. So, you know, it could still continue functioning. But the problem is if you put your ROD files in tempFS then, you know, once the gmedity server reboots and it's all gone. So you basically you need to sync it to this so that you keep this historic data. So in the new version of ROD2 there's a new daemon called ROD cache D. So basically what it does is it hangs on to, so on, it hangs on to write processes of these ROD files so that it will like, it will hang on to a couple updates until it's a specific time has passed or that, you know, there's enough like updates and then write it out at once. So in that case it sort of buffers the right so it reduces the IO levels by quite a bit. So if you have like, you know, if your gmedity is monitoring like over, you know, 1000 hosts then this is something that you can consider. So it's better than the tempFS approach because you don't need to sort of sync the files and there's just a better, well-rounded solution. Okay, so I'm not gonna try to like, give my own definition of cloud computing because there's already like so many but I guess I'll just talk briefly about like how we as, you know, the Gangler project want to address this. So basically for us like Gangler, the cloud environment's a dynamic. So Gangler was designed to monitor clustering grids which are pretty static. So your provision at once basically you don't, you know, you don't think of like, you know, going away is just, you just keep adding more hosts. So yeah, so basically we need to figure out how to handle this dynamic nature. So in terms of like networking, there's no multicast support. So by default, Gangler uses multicast. Obviously you can use Unicast, which I mentioned, but you know, if you can't use multicast, it changes like how you set it up and all these guys basically have like all these cloud computers, basically they have WAN IP addresses. So when you configure it, you need to like for low balancing purposes, you need to have some way of bootstrapping the configuration. So you would maybe when your cloud host boots up, maybe I'll talk to a centralized server to figure out, okay, well, which hosts I should like send my metric data to. So these are some things that we need to think about. And in going back to the dynamic nature, so you have this host Dmax, which basically tells Gangler how long to hang on to a host. So in typical cluster environment, you actually do want to know when the host goes down, but then like because in cloud environments, you ramp up and round down like pretty quickly. So do you really want to like keep track of it that way? So basically if the host Dmax you adjust it and just sort of ignores that the host is gone. So okay, so I'm gonna now hand off over to Daniel, who's gonna give some user testimonial. Okay, my name is Daniel Pocock. I'm working at a large bank in London. I've been deploying Gangler. And I've also been involved with the Gangler open source project for virtually the whole time I've been working on this project for my employer. And so I'm currently working as the release manager in the project as well. I've been doing that for about 18 months. Using open source methods is a key part of the job. It's something that was discussed right at the beginning at the interview stage. No indicated that this is a way that I work. And they were quite keen to pursue that with me. So we're just gonna talk a little bit about both what we've done with Gangler, the challenges we've faced, and also the aspects of working on an open source project in a corporate environment. So every big company has a different attitude to open source software. And you've probably seen this, that some companies talk openly about their involvement with open source and Linux. And other companies are very sort of wedded to Microsoft in a big way. So there's obviously distinctions between the meaning of free software and open source software. In a business and particularly in a bank, financial concerns are important. So we'll just talk about free as in no price tag. In the good old days, people didn't have to worry too much about the cost of software. They'd often buy things based on the support contracts, the size of the vendor, and various other factors. These days, people are looking at a wider range of options. I don't think I need to go into the reasons behind that. But open source software is being looked at a lot more seriously. And where open source software provides like a credible alternative, people have to look at it. On the other hand, using public email lists, IRC chat, sharing code on the public internet, these create issues for many organisations. They create issues of how the company has been portrayed like on the internet, the sharing of intellectual property. These are all challenges for different people in the company, some of them who are not from a software development background to put it mildly. Ganglia has provided a compelling reason to have those debates in the organisation where I'm working, and it has a relatively unique status. And we'll look at some of the reasons for that. It's not highly controversial, because it's a monitoring tool. So it's not the core business of the company. The core business is banking and not system monitoring. So it's not a big loss if we're collaborating on a system monitoring tool. So we can do that with the Ganglia project quite effectively because of the modular nature of the project. As Bernard mentioned before, with version 3.1 of Ganglia, you can develop your own metrics as modules in C or Python, and you can feed metrics in with GMetric. So if we have a need to develop a metric that uses proprietary code, we can do that, and that code can be separated using the module interface. So if we want to share parts of the common agent code, as long as that module interface is stable, then we can separate those things very easily. Just looking at the large enterprise environment, you've got a mix of different platforms. You've got platforms from different generations. So you have some machines running recent versions of Red Hat. You'll have other machines running, say, Windows NT4, for example, which is quite an old system. So if you look around in an organization that's large enough, you will find a little bit of everything. I mean, you'll find mainframes if you look around. The users have a whole range of different concerns. They're particularly concerned about something that might make their system less stable, that might steal resources from their application, or that might add complexity to managing their hosts. Fortunately, the ganglier agent is lightweight. It runs on many of the platforms in a big environment. The source code can be tweaked if necessary because it's open source. So if you have a particular need, we can recompile it for a particular platform. If we don't want to use a particular library or something, some of the libraries can be taken, some of the libraries can be disabled. So the PCRE support, which has been added recently, is a purely optional feature, so we can disable that. Some of the challenges that we face using the ganglier product in particular, it's heavily reliant on DNS. Once again, big organizations have a range of DNS problems. They're not connected directly to public internet DNS servers. If you've had a lot of mergers and other corporate activity, then you may have several different DNS zones within the organization, and they might be separated over different firewalls. There may be overlapping IP space and a whole range of things. Now ganglier relies on reverse DNS lookups, and it relies on the host names to generate file names for the graphs and to generate the URLs for looking at those graphs. So when you have a lot of DNS-related issues in your network, then those will be reflected in how you manage ganglier. It's not clear how ganglier is intended to perform with short polling intervals. While looking through the GMeter decode recently, I found some cases where poll intervals have been randomized by five seconds either way, but if your polling interval is, say, five seconds and you're randomized by five seconds, then you could reduce the interval to zero or you could increase it to 10. So I found that wasn't very effective, so we decided to tweak some parts of the code to handle that, but there may be more attention needed to deal with that. You've seen the example before with the hosts grouped into clusters and grids, but when you install the ganglier package on the host, how does it know which cluster to join? The current version of the agent is configured using a static text file, so you can include a text file in the package. You can also use a tool like Puppet if you have a Unix platform and you have Puppet across your whole network. You can use that to join different hosts to different clusters, but in an organization that has Windows and that has some hosts that are quite old, deploying Puppet would significantly magnify the effort of installing ganglier because then you've got to install two products and not just one, so that's another challenge that we're looking at in the ganglier project. To run ganglier on Windows, you currently need Sigwin. The good news is, using Sigwin, it does work and it's quite effective. The bad news is that there are issues with having multiple Sigwin applications on the same host. So once again, if you've got a lot of Windows hosts and if some of them have been around for a long time and some of them are quite new and they're all running different applications, you may not know if some of them already have Sigwin. And so when you put the ganglier agent on there, you could break something else. So once again, the Sigwin DLL is a challenge that we need to deal with. With the project that I've been working on and participating with the open source community, we've been able to discuss many of these issues and to find ways to manage them. Some of that work has been contributed back to the open source project. So I'll just bring Bernard back now to wrap things up and then we can go into some questions or some further demonstrations. Okay, so thanks Daniel. So yeah, so how can you guys get started if you wanna try it out? So I guess the easiest way to just use the pre-packaged stuff, so the Debian packages like Rahad, Fedora, Suzy, I mean they're all like been around for some time. And even on Solaris, I think recently Daniel did a lot of work. Well actually it works now, right? The OpenCSW, so you can get it for Solaris even. So yeah, as I mentioned before, even though you have different distributions, you just install the packages, they should just work. The only issue is sometimes with the free one version, you can't really mix it with a free O, but that's just like some subtle issues. So yeah, again, just install G-Mundi on all the computers you want to monitor for metrics and dedicate one server for G-Metody and a web server and basically you're done. So in theory you don't really need even a configuration file for G-Mundi but you can use it if you want. So if you want the bleeding edge stuff or somehow like there are no prepackage packages, then you can download the source table or even from our repository and just build it. So website ganglia.info, you can see what we've been doing and there's a Wiki and a source forge webpage. And there's, so we provided this framework for you to monitor systems and also these custom ways of like feeding these metrics. So there actually is a community around it to sort of write their own metrics so you don't have to reinvent the wheel. So there's people writing, as I mentioned before, like you know, monitor Apache or Mamakashti or there's a whole bunch of like these custom metrics that has been created already so you can check it out before you write it. So finally to, you know, there are a couple mailing lists, ganglia general, these are all hosted on source forge. Ganglia developers is developer mailing lists and you know, we are on RC free note. There's a Twitter sort of aggregation feed and you know. So actually one group of people I'm particularly interested in inviting to sort of join the project is I guess you've seen our front end. I mean it's been like that since probably the past five to 10 years. I mean it's pretty functional but you know what, as the project goes on what will be nice is a way to customize what you see on the front end. So there's already work done to make the front end more modularized so that you can customize it because some, like depending on your company's organization some group of people will want to see like your ganglia graphs in one way and then another group may want to see it different ways. So it will be nice to sort of provide some mechanism so that it's very easy to customize without even like writing any code or anything. So if there are any like Ajax or you know sort of JavaScript gurus who are interested in working on a front end project, let us know. So I think with this I like to thank Daniel for helping me prepare the slides. I mean he did you know all the stuff and thank you for them for inviting us to give this talk. So thank you. So I think we have, you know, if you wanna have any questions. Maybe I should be looking at the mailing list but we're using ganglia with IPv4. Are there any issues moving to IPv6? Sorry say that again. We're using ganglia with IPv4. Okay. Are there any issues in moving to IPv6? You said RX, okay sorry. Are there any issues on moving to IPv6? Have you seen installations using IPv6? IPv6. Actually that's a very good question. I don't, I'm not aware of any, I would assume that as long as your network, you know, your operating system supports it. I'm not sure if we need to make any modifications to the code because it's just, actually does anybody, you guys know? Have you guys, yeah. So yeah maybe take this, we can take this offline because I haven't, I'm not aware of anybody sort of needing this. I know v6 does at least somewhat work because it broke the broke BSD fairly badly when the port went in. So I know it's in there. So like you're saying that it works with ganglia just by default. Yeah because I haven't, I haven't tested it. I haven't actually haven't seen much traffic about it. So, but I mean definitely try it out.