 Well, I guess welcome all I'm I'm mark seager with HP cloud services and I'm a performance kind of guy and I do a lot of performance monitoring and visualization and such And I'd like to talk to you for the next hour 45 minutes whatever about some of the things that I've been doing in our own HP environment for Swiss monitoring and Basically, I'd like to first start out and tell you what problem we're trying to solve because I think that's often a good a good starting point I'd like to talk to you a little bit about this open-source tool. I developed a number of years ago called collectal and Then jump in and start talking about what we're doing monitoring swift and glance environments and also a little bit about what we're doing monitoring Nova environments So that said There's this interesting Conflicting problem goal statements requirements, whatever you want to call them At least I think they're conflicting and it's also it's also kind of a religious topic the notion of centralized monitor centralized monitoring versus, you know, node-centric monitoring and The way I see it is if you want to do centralized monitoring It's certainly it's certainly a great idea, but the problem is if you have thousands of Machines that you want to monitor and you want to Collect a lot of data you either, you know, it's one of these things You know, you get a lot of machines you get a lot of data You collect it frequently, you know now pick two or maybe even only pick one because you just can't do it All at the same time So you have these centralized monitor folks who want to be able to see what's going on in the entire environment and At the same time you have these other people who are interested in getting as much data as they can as frequently they can and They're two very different problem statements. There's there's no real intersection that I know of So in reality what you wind up having to do from a central monitoring perspective is You have to pick a subset of data that you want to monitor pick a monitoring frequency that's not too fast and Collect the subset and Then the dilemma is if you're a If you're in the support organization, you're trying to support a situation where the system crashed or you're having some performance issues Now is the time that you want some really fine grain data You want to see what's happening every five seconds or every ten seconds or every one second But the centralized monitoring solutions only collecting data maybe every one minute or every two minutes It may not be collecting the level of detail of data that you need to solve your problems And this this is what I think creates the conflicting problem statement I mean I'm I would guess that there really is no single solution where you can take a tool put it somewhere and do both But I would claim the solutions obvious if you need if you have two separate types of requirements You have two separate kinds of tools you pick one that's going to run centrally And it's going to monitor a small set of data every minute or two And you take a whole a different tool and you put it on all the individual nodes You let it collect all its data as frequently as one as much as it wants but that kind of has problems in itself because First of all is the data going to cross-correlate if you're collecting If you're collecting all those data centrally or the numbers that you're getting going to match the data Is you're collecting locally because it's certainly possible that the two tools? Oops, sorry about that The two tools may actually be Monitoring data differently even though it claims to be monitoring it the same for example I've seen tools that have reported system that have reported CPU load some tools include the IO wait Counter as part of the CPU load other tools don't so if your centralized monitoring tool does report it Your local monitoring tool doesn't report it. You go to correlate the data. It doesn't match that. That's just one example But then the other thing is what if you want to do some customizations and you want to collect some custom data centrally So you modify your central tool to collect the central the data locally But now your remote tools don't collect the data and again you have a you have kind of a mismatch So what we've kind of done in HP? We've kind of chosen a hybrid model and what that hybrid model says is you have a centralized monitoring tool and it collects a lot of the data Centrally and we have a second tool that monitors the data remotely But the remote tool has the ability to feed the central tool with the subset of the data periodically and As a result at least when you want to add some custom data to collect in the node level space That data will correlate exactly with the data you're collecting centrally because they're both collecting the same data from the same source That might make a little more sense is in the next slide or two But the thing is kind of interesting and this is totally a this is totally a coincidence Our centralized monitoring tool is called collect D. That's what a D at the end And our other monitoring tool is called collectel with an L at the end There's absolutely no relationship between the two of them and they just happen to show a lot of letters in common So what I want to do is I want to talk to you just for a couple of minutes about collectel because that's what I do and Also kind of give you an idea of the kind of information that we can get when it comes to fine-grained data So it has been around for probably a dozen years or more It's an open-source tool and it has its roots in HPC. That's high-performance computing So from the very beginning collectel had been designed to work on multi thousand node clusters collecting data at a very low At a very low Overhead so typically under a percent of a CPU sometimes under a tenth of a percent of a CPU depending on how you have it configured And one of its features is it coordinates its samples across the entire cluster to the microsecond or From Linux's perspective as close you can get to a microsecond so that when you're trying to Track down some kind of a cluster-wide event If you look at the samples on one machine and you see something happen that 1103 in five seconds You can find another node and then you'll see exactly what happened 1103 in five seconds on that note as well So that's that's that's really pretty important when you're trying to figure out what would hit, you know cross cross machine events One of the features that collectel has is it can actually generate data that you can immediately Plot some tools will collect data And then you have to like write a little script to analyze the data and reformat it if you want to generate plots Where's collective you can do it automatically, you know, you'll see a little later where that really is a handy feature The other thing it can do is it can actually collect data and send it over a socket To a remote monitoring station at the same time So that's another feature that it has that can help that can help address this kind of problem And it also has a built-in API so that you can add add features to it for whatever it is that you happen to be interested in That may not that may not currently be native to it And again, we'll talk about that a little bit as well And then there's a few utilities that you can use with collectel to it to extend its capability beyond a single node So real real quick, I don't want to spend a lot of time on this But collectel has this notion of summary data and detailed data and the thing that's important about this is summary data We'll talk about say CPU. So we're going to give you a summary of the CPU load, which means it's the Average load across all your CPUs You got to be careful because if you're monitoring an eight-node CPU and it tells you that you have a 12% load You might be fooled into thinking oh, I'm only using 12% of my system when in fact one of your CPUs Maybe at 100% and you're actually blocking so it's one of those things you got to be a little careful with The other thing it'll do it'll it'll tell you your aggregate network bandwidth So if you have two or three or four nicks, it'll add up all the bandwidth and report it as a single number It'll do the same thing with discs as you have multiple discs It'll add up all the disc I out reported as a single number and well on the surface You might think that's kind of a dumb idea because I want to look at individual discs I don't want to look at it. I don't want to look at just the total discs oftentimes in many instances It's a single device that's generating the load so by looking at the totals Without knowing which individual device you're interested in you can get a relatively good idea of what's going on But again, sometimes you want the individual sometimes you don't but the other thing that the other thing that this gives you Which I personally believe is pretty important as If you look at that top display, we're actually looking at the CPU disc and network load in a single line So that means when you're running this tool You can literally look vertically down a column and see change and to me That's one of the most important things about monitoring is you want to identify Anomalies abnormal behaviors spikes would have you and when you're displaying data on multiple lines where each line may represent a different a different type of device it's really hard to spot change and and That's the that's the hallmark I think of collectal and what I call brief and in brief mode you get one line per sample And that that's the key to keeping the back of your mind Verbose mode says okay. I can't fit this all on one line. I'm not going to try So therefore I'm going to take a certain type of data and take the entire line to display it So that's why in the top line when you look at CPU load All it tells you really is the total system load and the total user load when you look down here to verbose mode Not only do you see the user load in a CPU load? You see how much time was spent in nice mode? You see how much time is being spent on processing system interrupts? and And a couple of other per a couple of other parameters for for CPU similar things for discs similar things for network so that's that's kind of a Real-high-level example of some of the collectable data Detail mode simply says okay. You want to see the individual CPUs or the individual discs here. They are it's a lot harder to read You may not want to look at it in this particular format, but if you need it, it's there All right shift gears. Let's talk about Swift and Glance Currently there's really no good way to get to get additional information From what I've seen about what's going on inside Swift or Glance. They do happen to write a Transaction record in the API log every time someone does a get or a put or a delete There's a little record written into this log and it tells you what time to get occurred the the object size How many milliseconds it took to do the transfer? So that's kind of a cool thing So basically what happened was I wrote a little I wrote a little script That runs as a demon and it tails one of the log files It tails the log file on the Swift node or the Glance node and the reason I keep talking about Swift and Glance is Since their log files are almost identical. It makes it real easy to do the same thing for both of them instead of just one so it tails the log file and What it does is it it? writes a rolling counter into a static file so Every second it says here's how many gets I did Second later. Here's how many gets I did here's how many gets I did and it's not it's not one of these things where I've done So many get since the last time you looked it's a rolling counter and this is the one thing that a lot of a lot of surprisingly smart people who do Monitoring stuff have not grasped the concept of a rolling counter and the whole notion with a rolling counter says As many people who want to read the counter can they can read it anytime they want And then they can read it again whenever they want and just subtract it to and divide by the interval It's really easy and it guarantees that that you know multiple people can read the counter at the same time You never clear it. It just simply keeps incrementing forever When it hits a maximum value it goes back down to zero So if you read one sample and you read the second sample and you get a negative number You add two to the 32nd and now you got now you got the difference so this little utility reads the logs counts the gets to puts the deletes, etc And it writes them to this little it writes them to this little file that looks like this It's extremely ugly that nobody would ever want to read But it's in perfect format for a computer to read for a program to read it and parse it and do what you want So this gives you the ability to Then take this data and visualize it and the tool that I used to visualize it with this collector So I wrote a little plug-in that reads the collect that reads this file generates counts of gets and puts at whatever interval you want and Gives you the ability to display it like I showed you on that previous slide with this brief mode and this verbose mode and that sort of thing and Then there's this other capability that collect all has that says I can I can send data to a remote machine Or I can even write it to another file at a different frequency that I'm collecting it That's the key at a different frequency So what we have going on here is Collectal is actually running once a second collecting this data and logging it and doing everything it has to do and then every minute It says oh, yeah collect D. Here's what happened in the last minute So you wind up with kind of both scenarios. You're logging it locally once a second You're logging it remotely once a minute the people managing the environment centrally can take a look at This high level This high-level information and if they see something weird happened at a certain time of day They could go to the individual nodes and look at the fine grain data to see every second what was going on So here's what you wind up seeing and again I don't know how well this displays toward the back of the room but what we're kind of looking at here is that What you have in the very top display we can see the CPU load the network load and The glance load or the or the swift load and we're actually able to see how many how many How many kilobytes and how many kilobytes it gets and puts per second there were and how many Operations there were i.e. the total gets puts deletes etc And the and if there were any errors that occurred the next level down is the verbose display Which as I said before it gives the ability to display more stuff and fill up the whole line And now we're actually able to see how many gets puts deletes posts etc. There were This display down here the one second from the bottom that one's kind of cool because this can tell us of your of your gets how many of the gets This is the network banned with one. So this is actually this is actually telling you How many of those objects were you able to transfer? Between zero and ten megabytes a second between ten and twenty megabytes a second between thirty megabytes a second You can you can get a feel for what kind of load Swift is putting on your network or a glance and in the bottom display actually says of the object sizes How many object sizes were between zero and between zero and one megabyte one megabyte and ten megabytes? Ten megabytes and a hundred megabytes a hundred megabytes and a gigabyte So actually you can go back at some point in time and say jeez if there was a problem at a certain point in time What kind of transfers were in flight were these large objects small objects how much bandwidth were they using? Okay, a little bit of a gear shift again. I want to talk about Nova Same kind of scenario. I wrote a collective plug-in that Runs on the Nova server, so it's not running in the VM. It's running on the server and this guy is looking at VMs and the thing that's kind of interesting is If you look if you look at the command line that started the VM in the command line it Tells you the Nova instance ID Which is kind of neat and it also tells you the Mac address of the virtual Nick Well given the Mac address of the virtual Nick you can look in some of the System tables and figure out what its network name is like the net underscore 13 or 18 or what have you and then you can go inside collect them and say collect oh What's the bandwidth on? This particular virtual network and then in this little plug, and I wrote you can kind of put it all together So what we're looking at? Stop that so what we're doing what we're looking at here is this is a machine that has five Virtual machines running on it and for each virtual machine. We can monitor the CPU load the This is this these two columns with the BCK that kind of stands for Bach, which is our own Block storage that we can allow VMs to mount so you can monitor the block storage you can monitor the the local disk activity You can monitor the network activity and you can tack on to the end of this the instance ID and then using like Nova Manage you can then figure out who the user is it's associated with this and It gives you a really nice view of what's going on on all the VMs on all your Nova on all your Nova servers And again this data here in our case is being collected every five seconds, but we're also able to send it Remotely up to collect D so that it can display this information once a minute So again you get the coarse grain at the operation center You get the fine grain on the individual nodes in case somebody needs to drill into that data some more so that said there's still a couple of missing pieces here and basically what it is is I I wrote some more software that kind of says I still want to Visualize some stuff centrally Based on all the stuff that I'm collecting all these glance nodes and and Swift nodes So let's use the cloud to monitor the cloud and and what I really mean by that is There's a bunch of calculations that you can do on the individual nodes and then remotely and parallel Copy that data up centrally and if you're only doing text and if you're not dealing with databases and stuff The this can be pretty fast So the oh and the other piece that I'm also doing is Actually rendering some of this collect all data into plots in Parallel on each of the machines so what this all turns out to is when you pull it all back You kind of wind up with this what I'll call a crude but useful on Interface and again, I apologize if people in the back or having a hard time seeing this But basically if you look in that upper left hand box What we're seeing is a summary by day. So the the labels didn't I left that off here when I when I made the slide but we're basically looking at a week's worth of Of proxy operations and you can see that we're doing on the order of some 20 million operations a day and If and and everything on this upper left hand display is a hyperlink So if you were to click on that link For like the 167 e the e was an experiment of mine that said when I see errors I'll put an e next to it and of course I forgot the number one rule of Larging large environments everything's always going to be in an error state So it doesn't really provide a whole lot of useful information But if you click on that link here's what you see and what you see in this link is you get to see the 16 proxy servers that contributed to that number of I Guess that was 10 million operations that day and you can actually see over here the name of each machine How many operations it did how many gets there where how many puts there were how many deletes there were and again? Visually you can start looking for some anomalies and things and you know for example According to this on proxy 3. There were 6 that's 6 million gets on proxy 3 And none of the other proxies did that many gets you know, I don't know why it just happened to happen that particular day One of the other things that you can do with this display each one of those node names is a hyperlink Actually, I don't think I think Not yet. We'll get to that one But if you back up here again, this display is actually a lot wider than would fit on the screen So I've broken into two displays So what's to the right of this display that you're looking at is a further breakdown of How many 401 errors there were how many 402 errors there were all the different error codes and how many different errors there were as Well as this is the get-and-put network bandwidth about how many objects were I'm not the band with the object sizes So that first column that says get one is telling me how many objects were on between Ten megabytes and a hundred megabytes and the get two is how many were between ten megabytes and a hundred hundred megabytes and a gigabyte Etc. If you were to then Click on one of the CD all these error codes if you click on one of the error codes You can actually go into the error log and see the exact error messages that generated those error codes so back to These node names up at the top that I said were hyperlinks if you click on one of those node names You then get this display that breaks out the gets the puts and deletes by hour so now we can start drilling down a little deeper and for example looking at this display you can see at five o'clock at five o'clock there were four hundred and four posts and All the other ones were single digits or or barely single digits. So again All all interesting data that might be worth exploring in more detail Okay, a little more of a gear shift I mentioned some of these utilities that go with collect all and one of the utilities is this thing called call plot So what I was saying before was every day or every night call plot goes and Generates a whole bunch of plots from the day before and uploads them all to the central server and What that means is you can statically display these plots and By statically displaying the plots you can look at hundreds of plots all at the same time Very very quickly if you have to render them when you want to display them it gets a lot more expensive and a lot slower So again this little interface that I built which is very crude turns out that like this a AZ1 and a AZ1 2 and 3 are our production servers and then the rest of these cist test ones cist test to R&D These are all our test clusters So you can pick which which type of data you want to look at and then say I want to look at Bach data database data Glance data Swift data, etc and then you a little click on that little lower left hand button and It'll say okay, you wanted to look at this data What kind of plots do you want to look at and it just has a whole bunch of different links to a whole bunch of different plots? and since and since we've been talking about Looking at glance and Swift Operation counters which in this case are the gets the puts the deletes, etc. I Highlighted those two is the kind of plots you could look at and when you actually click on one of those buttons up pop these plots and They're very very detailed you really can't tell it from the display, but each plot has 86,000 data points on it because it's one a second for an entire day and I Guess PNG is a very impressive Technology because each one of these images is only 10 kilobytes I'm still not sure how they fit all those all them pixels into that one little image But that's how I I checked it multiple times and they're really only 10 kilobytes So these these these plots render very very fast. So again in The back if you really can't see what's going on here We've got a lot of different colors on the plots and one color is the gets and one color is the puts and one color is the deletes That's the top three plots, but when you when you Bring up the web page. There's really like 16 or 20 plots on a single page And you can very easily scroll up and down to compare what's going on and The plot on the bottom is actually showing me my put Rates of objects greater than a megabyte greater than 10 megabytes greater than a hundred megabytes So if you could envision a plot That's showing this kind of data as well as CPU load or network traffic or whatever Or disc load you could you could then start saying things like you know jeez. Here's this spike You know, I can see this spike right before 1600 and you know that spike goes straight up through the top and It's like whatever happened, you know, someone did a get no, I'm sorry they did a put on a large object and It probably crossed those three different servers because I can see the three different spikes on those three different servers Okay, another little bit of a shift gear There's another tool that goes with this environment and it's called call mux and call mux stands for collect a multiplexer and It has an interesting property and the property that it has is it will allow you to run Collectal on multiple machines of your choosing It'll take all the output from collect all and Display it in a single display much like top Okay, but the big difference is instead of looking at top processes top processes You're looking at top anything So anything that collectal can report you can display in this form and you can sort it by any column of your choice So in this particular case Again, we're looking at Swift data So we can actually look at the gets the puts the deletes Across all the proxy servers sorted by any particular column and That'll and that can help you identify proxies that may be that may be a little hotter than other Proxies that are you know getting more more their load than they should or you might identify proxy servers They're getting less of the load than they should or if this was a Swift object server You could actually say I want to look at the disks on these devices and you can sort them by which disks have long long wait times or big I oq's or long service times and again you can look at them across dozens or even hundreds of nodes all at the same time and By the way, this is part of you know the collectal open source stuff. There's nothing. There's nothing Cloud specific except in this case I'm showing you how it works with some of these plugins that I that I wrote that that work with Swift and glance and There's a second form that actually See that the good thing about the top guy is it lets you look at the top activity At any point, you know every every second when it refreshes the biggest problem with a tool like top Whether anybody's really noticed this or not when you run top It'll tell you the top the top process at this second Then the next second I'll show you the top process and it's like wait No, was that the same process or a different process or the process is changing every second There's the same process always the top process and you can't really tell because there's no history with top It's always instantaneous. Well the second form of call mux Gives the ability to say, okay, I know I can't display everything Historically, but I want to pick one or two or three data items and I wanted to display those Every second on every machine. So this display over here at the bottom is Saying I want to look at six machines and two elements I want to look at gets and puts and now every second it displays a new sample So now again, we're into that whole business about being able to spot anomalies with vertical columns Every row represents a complete sample across all the machines that you're looking at And again in this case, we're only looking at six machines. So it's not particularly exciting A few years ago I was uh, this was back when I was doing high performance computing I was at a customer of ours who um, who had a 2000 node cluster and they had Four 40 inch monitors or 50 inch monitors I mean it was like mega wide and when I saw that I said wow, I gotta run call mux on this So I ran call mux on it and I was monitoring. I think it was 192 compute nodes So I want to show you a picture of it and I want to caution you before I bring up the next screen You're not going to be able to read it But that's okay. I can't read it either But what you can do I hope is shape recognition Because it makes it really obvious. I think what's going on If you look at this display right here Without even being able to read stuff You can see those guys on the left aren't doing anything. Oh, they're all zeros You know the guys over on the right you can see where it gets a little a little lighter and a little You can see a burst of a lot of cpu traffic. You can see, you know, this solid band Where it says very busy. I don't know what the numbers are but you can see they're all pretty busy And then you can also see in the in the right hand side of the middle monitor It's like wow things are really erratic here. I don't know what the hell's going on But this is some of the kinds of stuff that you can do with call mux And displaying things as a single line Sometimes it'll help, you know, you get to see properties that you couldn't see before So again, this kind of takes it all back to this whole notion I think I hope of central monitoring versus local monitoring Centralized monitoring you really don't care about this type of data that often unless there's a real problem But having this data available locally can really provide a lot of insight into what it is you're trying to look at So that's kind of that's kind of this dual Schizophrenic model that we're using at least with some of our data So does anybody have any questions or No, no, I I do text I mean I mean numbers Yeah No, that's that's that's from the nick If the the one thing that that may or may not be obvious With with node level monitoring is almost everything is available in slash proc If you've ever looked at slash proc, there's like a ton of data in there So they'll so like if you do If you do what is it like if config? You'll you'll see some high-level counters and other stuff all that comes out of slash proc And that's the same data the collector was reading. Okay. Good. Thank Yeah, no, that's a good. Yeah, that yeah heat maps is something I've only just scratched the surface on one of my colleagues had done a lot of taking some Collectile data and sticking them to heat maps and you uh, you see some very interesting patterns arise that you didn't normally see Yes, sir Is this on I guess so This would be a good point a time to point out that uh, swift has stats d metrics emission as well So A lot of the similar data is available through that avenue as well And where does that data come from? It assuming you configure the proxy servers to send it They're emitted as stats d udp packets So you're still you're still responsible for running a stats d server So you can there's a number of ways to deploy that you can go with a central server We do one per node And then aggregate As a second step how often does this send out the data? I mean it's configurable Oh, sorry the the data is sent out real time the udp packets and then you configure your stats d server to sweep through and Yeah aggregate and flush so it's it's the It's the collect aggregate flush cycle as opposed to rolling to roll encounters Once the data gets into whatever Stats d stats d is upstreaming into whether it's graphite Which is sort of oh, okay Dude, actually, there's also a collective interface to graphite So you could send it to help out collect all have collect I'll send it to graphite I guess the only I guess the only thought that comes to mind is what the whole udp thing is if you're a lot If you're if you're not listening then you're gonna miss something Right, right It does have that property But anyway, it's uh, it's in the code base and Okay, if yous Okay, cool So congratulations. It's it's wonderful work I was wondering how much of what you're planning to contribute back to the community Excuse me. I was wondering how much of what you've contributed what you've created here You're willing to put back into the community. Oh, I If you were one of my earlier talks, um, I'm looking into this. I'm hoping that I'm not going to have a I'm hoping Not to have a lot of issues trying to get things forward A lot of the stuff that I showed you was already open source The one challenge I might have with the community Actually, this is as good a time as any to ask it. I I'll make a statement. Then maybe I have to hide under the table. Um, it's all written in pearl When um when I first started doing all this stuff That that was the that was the language my group used So all of collectal and all these plugins and all everything is all is all based in pearl and um Unfortunately, collectal doesn't know how to talk to python So all the plugins have to be written in pearl as well But depending on how amenable the community is To uh, non, you know non python stuff It's it's not all that it's not all that complicated But it's certainly something I would like to look into making available to the community so Perhaps we can get there. Alrighty. Thank you. Mm-hmm Anybody else have anything Okay, well, I guess um, I'm not I'm I might be a couple minutes early here Yeah, shame on me