 If you don't know me, my name is Matt Cowher, one of the track leads for the experiments track. As Dr. Max said in all the other sessions in this room, if there's something you do like about this track and you want more of, let me know after. If there's something you don't like, let me know after. This session is particularly near and dear to my heart. The first about five years I spent as an assistant man was writing monitoring tools, which was a painful and thankless process, which I'm sure that Kurt probably mostly agrees with. It is thankless, but we get to thank him by making him talk about it for half an hour on stage. Cool? All right, thanks, Kurt. Very good. So what we're gonna be talking about is CF Top, and if you don't know what that is, that's what we'll show and we'll demo along the way here. So a little bit about me. I'm a cloud architect with ECS team, which is now part of CGI. I'm a cloud native developer and wear a variety of hats, but for at least the next 30 minutes, the most important one is that I am the author of CF Top that we're gonna be talking about. So the first part to understand is why I created it initially. And the reason for this is that we were engaged with a client in a production support capacity to be able to look at a cloud foundry installation that was running production applications and provide feedback to the client about if things were optimized appropriately, if any of the applications that were running on the platform were having issues. And what I quickly discovered is that I was flying a bit blind, which is to say that they had a syslog nozzle that was capturing log events from the fire hose, but they weren't actually capturing container or platform metrics at all. The only thing that they were capturing is the application logs, so standard out, standard error, and it was going to Splunk. The other thing that they had installed was the JMX bridge. So they were using app dynamics to be able to query through JMX, the stats, and they were providing some somewhat rudimentary dashboarding. So you could see some information about what was going on in the platform, but it wasn't enough to answer some basic questions such as what application is taking the most traffic at this very moment. How much CPU is this application consuming at this very moment? And those were the things that I wanted to figure out and know, and so I kind of looked around to see is there anything that can provide me that with a low friction for installation and usage, which is to say that I didn't want to have something that you had to actually install on the platform. The reason for that is that I'm dealing with a production platform and to be able to install yet another application in production can cause problems because you actually need to go through change control requests and figure out is this an allowable application to deploy and so forth. So what I kind of realized is what I wanted to do is something as simple as the application that you're probably all familiar with in the UNIX world which is the top command, the UNIX top command. You run UNIX top pretty much if you're on Linux or HPX, AX, you run top and you pretty much get what you would expect which is CPU utilization, memory consumption and a bunch of other stats. It's very simple and low barrier entry it's usually already there and if not it's easy to install. So that's kind of what I was looking for in some mechanism and since I didn't find it, that's why I wrote CF top. So what it is is a plug-in into the CF command line. So if you're not familiar with plug-ins, CF of course is the main command line tool that you would do like for instance CF push. So all applications developers would be familiar with that and admin's platform operators would be familiar with this as well. So what Pivotal or Cloud Foundry has done is provide the ability to extend the set of commands that you can run such as you can do your CF push, you can create orgs, you can create spaces and you can scale your application. Those are all built-in commands and you can write as a developer additional commands that augment the CF command line with additional functionality. And so in this particular case I wrote a plug-in called top that extends it with a new command called top and the way that you would go ahead and install that if you're not familiar with plug-ins is going to that URL plug-ins.cloudfounder.org that actually goes to a website that provides you a list of all of the plug-ins that you can install. So these are all augmented things from a command line and the one that we're specifically talking about here is this CF top. Installation instructions are on all of these guys so literally it's just doing CF install plug-in and following that guide right there and you're good to go. So what we're gonna do here is go through and demo what it actually does and we're gonna talk through the various components here and what I encourage you to do is ask questions along the way, I kind of intend this to be interactive otherwise I think you and I are both get bored. So in this particular case if I type in CF plug-ins and I probably need to stop this in order for you to see what I'm doing. So if I do CF plug-ins what you can see is all the plug-ins that I have installed and specifically the one we're looking at here is the top plug-in. So I'm gonna run CF top and I'm already targeted, I'm already logged in and what this is doing is this is providing me information about this particular foundation installation. So what we're gonna do is we're gonna walk through the header, the components at the top and then start walking through some of the information that's down below. So the information on the header here, first item you see at the top left here is the events section. On the events section what that's actually showing you is the number of events that have been seen on the traffic controller or the logger gator, the fire hose is what you're really seeing there. So in this particular case we got about 400 events a second going through the fire hose and how it knows that is this is initialized two nozzles against the fire hose to be able to actually receive all of those events and to be able to report on them and display information against that information. So this guy here, the events is counting how many events since CF top has been started that has been seen. So in this case since we started it one minute and 20 seconds ago we've seen 41,000 events on that fire hose. I'm gonna actually start this again. The second thing I wanted to make note of here is this warm up period and this is a countdown of 60 seconds and what is happening here is that CF top is intended to be as friendly as possible to the foundation in which it's monitoring. So there's no polling going on to figure out hey, what's going on in your platform. This is literally just passively listening to the events that are happening on that fire hose and aggregating a bunch of stats together and being able to display what is actually going on. So in order for you to get an accurate picture of all of the information, it takes about 60 seconds which is to say like container events as an example output their information per container once every 30 seconds. There's information that happens on the Diego cells and those actually only come out once every 60 seconds which is the reason that we have a 60 second warm up period. After that 60 seconds, all of the information that you see here now should be an accurate picture of what an environment looks like. So if we continue to look at here, the next section here outside of where we're targeted and who we're logged in as is the stack in which the Diego cells that have been reported in belong. So in this case, we have three Diego cells that are of type CF Linux FS2 which is the default stack and information about, this is aggregated information about all of those cells. So we have CPU utilization. This is the CPU utilization cumulatively across all three of those cells. We have what the maximum CPU is. And the reason that number might look a little strange is in how do you get 1200%? Well, so how that works is very similar if you're familiar with Linux and if you've ever done a top on Linux when you have a multi-threaded application, you can actually have more than 100% CPU utilization. Similar here is what it's doing is determining how many CPUs are in each Diego cell and then time is emailed out by how many Diego cells you have and in this particular case we end up with 1200 because each Diego cell has four CPUs. So four times three, we end up with 1200% as a maximum if you maxed out all Diego cells in this particular stack. We then have memory information about how much memory is used, how much memory is available on these Diego cells, how much is reserved. Reserved is when you push an application, it has a quota and that quota basically decrements if you will, how much is reserved within a Diego cell and then again in this particular case when the header information, we're adding that all together and determining how much is fully reserved within the three Diego cells we have here. In this particular case, we have 129 applications deployed to these three Diego cells. This is not containers, that's what this 134 is, which is to say that we have 129 unique applications as in like CF push, the name of an application. Some of those applications have been scaled to more than one, which is how we end up with more containers, which is typical, you would never have more, less containers than you do have applications, which isn't actually quite true, which is to say that you could have applications that are pushed to Cloud Foundry that aren't started. So if you actually look at this and say, oh, for some reason if you had 129 applications but only 120 containers as an example, that's actually legal, because maybe some of your applications are stopped or that they're started but crashing. So that's how those numbers can play out there. And disk is similar in memory of how much is used, the maximum amount of disk space you have available and reserved. The next stack, it will just list out all of the stacks that are available to you in this particular case. This foundation that I'm looking at has a Windows cell in it as well, so that's listed here. This is also aware of isolation segments, if you're familiar with those, and if you have isolation segments, those will be listed here as well, so that way it groups together all of the Diego cells by stack by isolation segments, because you could actually have an isolation segment A that actually has multiple stacks in it. So all of that information is sort of broken down but aggregated by that particular type. I'll go over the pieces here on the alerts, so hopefully if you're monitoring your platform, you actually don't have any alerts or warnings. But in this particular case, I have an application that is not in the desired state. I put, to try to help people understand what this really meant is that this is the DCR's desired container count, and the RCR is the actual reporting container count, which is to say that I wanted some number and the number that I've got does not match. So in this particular case, I have a crashing app in here and we'll look at that here in a second, but that means that I wanted, in this particular case, I wanted one instance of this particular application, but because it keeps crashing and is not able to start, I've got zero, so it's warning you that you have in this particular case one application that is not meeting what you want it to run on. And then the next thing is a warning because this is now part of Cloud Foundry's self-healing, is that I have 33 containers that have crashed in the last 24 hours, three of them in the last hour. So in this particular case, this application, as long as it's not also the application that's not in the desired state, is running, but it keeps crashing. So this may be an indication that there's something going on. Actually, it is an indication something's going wrong and you can actually look at the information that we'll look at in a sec to figure out why an application might be crashing. So then, actually let me take a pause there before I get into the details, the application list down below. Do we have any questions at this point? It looks like it says admin at the top. If I'm not an admin on a foundation, can I still use the tool? Yes. So there's a couple of different parts to that question. One is that I am actually logged in as admin, so obviously I have full authority to do anything. But CFTop actually has two permissions that it needs. One is the Cloud Controller admin and the other one is the Firehose permission. You can assign those permissions to any user, so obviously you'd want to assign them to a user that is appropriate because obviously at that point they could see everything that's in the Firehose, which includes any application logging and if people are logging some sensitive information, that could be bad if you just gave this out to everybody. So those are the two permissions you need to run this in what I call privileged mode, so the mode that you're seeing right now. You can actually run this without those permissions meaning in a non-privileged mode and if you were just a developer that is not gonna get those two permissions, it runs in reduced functionality mode, which is to say that you only have the ability to see, obviously, the applications, meaning the orgs and spaces and the applications in those orgs and spaces that you have permissions to see. And the other thing is because I can't initialize a Firehose connection to Cloud Foundry because you don't have that permission, it actually initializes an independent web socket for every application that you can see and I put a cap of it at 50, which is to say that it will show you the first 50, which when I say first, the oldest 50 applications that have been deployed to the platform that you have authority to see. So, and then there's some other functionality that I can talk about later that you won't have the ability to see, such as the header information will not obviously be able to show you all of the aggregated cells because you don't have the level of information as a regular developer to see that. So it does work, just works in a reduced functionality mode as a developer. Any other questions at this point? All right, so let's go down then to the section down below, which is a list of all of the applications. And again, since I'm logged in as admin, this is all of the applications on the left-hand side here that are running on this platform. So this should be 129 applications, actually 130, because of the Windows guy as well. And so what we're seeing is the application name and of course the Oregon space in which it's deployed to. DCR was the desired container count. This is how much it's scaled to. So if in other words, if you've scaled your application to five instances, DCR would be five, meaning that's the desired, that's how many you want. RCR is how many are actually reporting in. Hopefully those two numbers match. If they don't match, that's generally why you would get the red alert banner is that saying that something is not working. It couldn't start it or it crashed. And it's currently in that state, meaning that it hasn't recovered that container yet. The next piece, which is the default sort, when you bring up CF-top, which is the CPU utilization. So right now we're doing a descending sort on CPU percent. And so what this is is the percentage of CPU that's consumed by all instances of that application. So again, let's actually look at this. So my very top application here is the test app 001. If you look at this, we got a DCR of four, meaning we have four instances of that thing currently scaled. We actually have four reporting in, four containers reporting in. Aggregated together, all four of those are consuming 1.7% of the CPU in the Diego cell or cells that are within the stack and isolation segment if you had it of where it's deployed. So in this particular case, it's a Linux default stack. So it's actually the top guy. So part of this 9.3% that we see at the top is part of that number is this 1.84% that this particular guy is consuming. Continuing on the right here, the CRH is the crash count. This is what this guy is reporting in. It says that somewhere in this list, we've got 34 crashed containers in the last 24 hours. But we can see that this particular application has not crashed, at least not in the last 24 hours. The next one is a memory used and again, depends, I mean, this is how much memory is used in that container. So whatever it's using, if it's Java or Node or whatever, that's just how much memory has consumed. Not necessarily, obviously, your quota. Same thing with disk is how much disk is used. I'm actually using my right arrow here to go through additional fields that are available. The next one is the response time of this particular application. The reason that some of these have dash dash is that that number is the response time for that application, at least one of the containers in that application, in the last 60 seconds. So if a dash dash means it's got no traffic in the last 60 seconds, if I kind of keep going to the right here for a sec. In this particular case, hopping all the way over to the REQ slash one, that is the requests per one second, which in this particular case, I'm throwing a load test at this at 50 requests a second. So I've got another command prompt that's actually hitting this application with a GET request 50 times a second, just so I can put some traffic on here. So those 50 requests a second, each individual request is averaging at about 2.3 milliseconds per call. So again, this can be a useful tool if you're getting reports that, hey, I'm getting really slow response time. Nice thing to fire this up and take a look. Now, in this particular case, you might want to get a baseline, just figuring out what is normal. So in other words, if you have an application that normally responds, has a 15 second response time, is that good or bad? Well, I don't know, it depends whether that's kind of your normal baseline. Which is something I want to bring up real quickly is that this does not, meaning CFTOP does not replace historic monitoring tools, which is to say if you're going to Splunk or you're using Grafana or a number of other monitoring tools because all of those have history. Meaning that they're actually capturing all of the information from a platform and you can go back in time and actually look at what's going on. CFTOP is not intended to do that, which is to say that there is no history here. Meaning if I exit CFTOP and bring it back up, all of these stats go back to zero and it's from the time you start CFTOP with the exception of the crash count. That's the only one that actually has history if you will in the last 24 hours. Just going through this quickly here. The slash 10 is the last 10 seconds. How many requests have occurred in the last 10 seconds and slash 60 is how many have occurred in the last 60 seconds. This is the total request count. Again, since CFTOP has started, so since we've been running CFTOP now for 14 minutes, we've gotten 42,000 calls on this particular application. This then breaks it down by HTTP response code. So in this particular case, they've all been successful to XX meaning that it's returned 200 or somewhere in the 200 range. This again is a nice way to be able to determine if any of your applications are exhibiting a problem because hopefully if your applications are behaving well, you should not have any 500 errors. But this again, this is a way to see quickly if any of your applications are producing 500 errors or 404s or whatever. The isolation segment that is assigned to this, the dash dash means it's the default isolation segment, often called the shared isolation segment. And then going all the way to the right is the stack. So let me pause there for a second before we dive in any further, any questions? We've got a mic that's coming around here. What does the, on the application, what does the blue represent? So the blue represents traffic that has occurred in the last 20 seconds, I believe. So basically it's recent traffic. And again, the important part here is it is traffic that has gone through the go router. And the reason I say that is that if you happen to have container to container communication, meaning that maybe you're using Eureka or something that actually is talking directly container to container, that traffic is not visible from the fire hose perspective because it didn't go through the go router, in which case it may look like you have an application that's taking CPU and it's not blue there and this is not counting any request counts. And you're like, well, why, what's it doing? Well, it could very well be busy. It just has not gone through the go router and therefore it doesn't know about it. Do you have any kind of batch modes for this? Like an ESX top, how you could do a batch mode for more analysis capturing? I do not. Not at this time. But feel free, this is all open source by the way. So again, if you just go out to the plugins.cloudfounder.org, you can install it. And in that same section where you install on that webpage, there is a link to the GitHub repository for this guy. That would take you to the GitHub repository for CF top and then you could actually log an issue for a feature request if you wanna kind of explain your use case. So I encourage any of you, if you've used this before and it doesn't quite do what you wanted to do, feel free to add some feature requests and we can talk about that. All right, so the next thing we can do is actually delve deeper. So remember, I'm focusing on this test app 001. Remember, there are four instances of this application, meaning there are four containers. If we drive into that one, we get a little more detail about that particular application. So in this case, we have the four and remember I was saying there was like 1.8% of CPU, you can now see how that actually breaks down into each container as to how much CPU it's taking. The other thing that this shows you, which is possible to get through forensic analysis, but sometimes challenging, is what Diego Cell is actually hosting this. And the reason you might wanna know that is, well, what other things that might be running on that Diego Cell and we'll get into that here in just a second. But that's what the Cell IP address is on the right, is what Diego Cell is actually running this particular application. And if you recall, we have three Diego Cells that are in the CF Linux stack. And so that's the reason you end up seeing two that are hosted on .66, because of course we have four instances, but we only have three Diego Cells. So of course, one of the Diego Cells had had two. And then the information here is self-explanatory breakdown of each container and how much memory and disk it's using. The only thing to note here is the, and we kinda skipped over it on the other page, is the standard out and standard error. These are how many, obviously, logging events have occurred from that container. If I go back a screen, that same information aggregated is on this screen, standard out, standard error, it's just the aggregated. Where this is useful from a use case perspective is if you are dropping a lot of log messages, you're like, I don't know what's going on here. It may be that an application went to an environment, prod would be an environment that we were dealing with that had debug, or worse yet, trace level turned on, usually by accident. Somebody checked in a log for J or some kind of configuration file accidentally. And all of a sudden now production is logging lots and lots of output. And so obviously if you're logging the stuff to Splunk, if somebody's paying attention, they could do a query and say, oh wow, there's a lot of logging going on in this one particular application. But somebody would either have to have a query that's running all the time in some kind of schedule or they'd have to kinda be paying attention. Where in this particular case, it's really easy to identify the offending application and say, oops, somebody messed up. Or maybe they didn't mess up, maybe you're actually doing a load test. But again, maybe it's a load test and load test environment that somebody just forgot to turn off debug and that's affecting your output for your load test. So that may be an indicator that you wanna do something. So let's see, what else we got here? A lot of these things are self-explanatory. We got the HTTP rate here, which is the same as the prior screen is how many are occurring per second, per 10 seconds, per 60 seconds. The response time averages across that same thing. So 2.8 milliseconds. On the right side, that's how many crashes that have occurred in those time intervals. In the bottom are the particular containers that are going on there. So the next thing that we can actually look at is drive a little bit deeper and one is getting general information about this application. So in this particular case, the application has, and sometimes it's useful to actually get the application ID, it's GUID, it's Oregon Space, you would have already known. In this case, we're in the isolation segment shared. What build pack it was built with, in this case, this thing was built with the, deployed with the static build pack. And when the deployment actually occurred and what it's actually reserved or quotas were was at the bottom here. If this was a Docker image, the build pack would be replaced with Docker information, such as where the Docker image came from. The next thing is a view crash count. In this particular case, since this one hasn't crashed, there's nothing there interesting to look at. And then the next one is HTTP response code. Again, this one's not particularly interesting because all of my requests here are doing get. So actually, let me pick a different application. Might be mildly more interesting. Maybe the eureka guy. So in this case, this one's doing a both a get and a put. Fortunately, they're all getting 200 as response codes. But again, if you're a developer, particularly, and trying to figure out what is going on, you can start driving down and say, if you looked at the summary screen, it just said like 5XX as an example. So it's like, all right, well you know you're getting some kind of a 500-ish return code. But you don't know specifically, was it a 505? Was it, you know, or even the 400? You know, are you getting any 404, 403? So this one actually tells you specifically what the response code was, how many times that response code has occurred. So in this case, we have 132 get methods calls with a response code of 200. And when that last occurred, and then the last response time, again in milliseconds. And then if we drive down to find our, let's go this way, our crashed app, this is the one that keeps crashing. So this one provides you with a little more information. Again, all I'm doing is in the details of this application, the very bottom of the screen, just from a helpful hinting perspective. These are not all of the commands at the very bottom. So when we see X for exit, D for display, order, filter, those are not all of the things you can do. These are just sort of the quick tips saying, ah, these are the most common things people do. If you type H on any of these screens, this gives you a laborious detail about every single field that's on here, what it means, how it's derived. And then down below is all of the shortcuts and commands that you can actually type. So I'd encourage you, if you're gonna use this, go ahead and explore. There's lots of hidden nuggets in there. So going back then to the crash count, this gives you when the last time this application crashed, the index is the container index number. So in this particular case, I think this thing is only scaled to one, meaning it only has index zero, so hence they're all zero. But you can actually see if the problem is in a particular container or if it's like moving around. And then the other thing is the response or the return code from the application. Now in this case, this particular application we're looking at is a app called misbehaving app that I'm specifically returning 66 out of this app, hence why it's 66. Realistically, when you get crashed apps, it's likely that it's returning you something else that you're not doing it. And so in the help, if you type H help, I provided a little bit of information as to what response codes could mean. And really the main guys, at least from a Java perspective, other obviously, when we talk about node and other build packs, maybe likely will be different. But in the case of Java, these three return codes, so 137, 143, 255 are the most common ones that you would get an out of memory case. So if you're actually getting those back to here, where I would say exited with status code, one of those, chances are your application is running out of memory, meaning your quota is too small. Or you got a memory leak or something to that effect. I could probably go on and on, but I think my time is actually getting close to an end as fast as that went. So I did wanna do a few minutes here if anybody has any questions, and then if not, I can actually kind of take it offline. So if you wanted to. What else can I see besides applications? Ah, good question. You're right, there's so much more that we didn't see. One of the other, well, a couple of things. One is certainly as an operator, if you're dealing with constant issues with getting low on quota space in your Oregon space, and Oregon space in general, you can actually look at aggregated stats from an Oregon space perspective. So in this particular case, we can see that we have four orgs, and at least in the top org here, that we are dangerously low on our memory quota, hence why it's red. It's red when it gets to 90%, it's yellow if it's at 80% or higher, and then it's not colored specifically if it's below that. And then again, you can drive down. So this is an ECS org, so if I press enter on there, it'll tell you the spaces, and if there's any particular space quota, and again, colorization there. The other thing that is extremely helpful from a debugging perspective is the cell list, which again, all of the information I'm showing you here is available without CFTOP. I mean, CFTOP's not doing anything special here, it's just aggregating all the data that's already available to you in a nice humanly consumable format. So you can do the forensics and figure this out, as in this case, figuring out how many containers are currently running on the cell number 66, with IP address dot 66. But it's sure nice to actually look at here and say, well, there's 37, and how many CPUs it has, and going back to debugging use case is that, is there any cells that are overloaded? And if so, which one? And of course, the next question is, all right, I can see one of my cells is overloaded, meaning overloaded from a CPU perspective, is that say 400%, again, we have four CPUs who go to 400%, you're like, well, okay, but what's the offending application or what container would be the more accurate way to say that? What's the offending container that's running on that thing that's causing that to happen? Well, you can drill down. So if I actually pick a particular cell, you press Enter, and this will show you now, this is all of the application containers that are running on this particular cell. And again, it's defaulted by CPU utilization. So at the very top of that list would immediately show you what application is the offending application that's causing that cell to be highly consumed. So, say again. Okay, fair enough. So anybody got a one minute question? Go ahead. Detentancy within different networking segments within there. Can you actually see what Diego cells are members of which? Let me get that term for you, man, I'm sorry. Thank you. Can you actually tell, oh yeah, you were in there too. Can you tell which Diego cells are in which isolation segments from here? Cause they were mentioning in the initial version of that they weren't able to do it. So it's good to get like the IP address range, the IP addresses for your Diego cells and kind of go through that process a bit. Yes. Yes, you can. Thank you. So the answer is yes at the moment because they have not segmented the loggergator to actually segment the traffic. It's all in one, which is meant, they mentioned that in that talk. Because of that, then the answer is yes. This is an example of now a different lab, different foundation that has two isolation segments, Spoke A and Spoke B as well as still, it's also got a Windows guy. And so that actually is providing you stats across that whole piece. And so yeah, that information is available. And then they can also actually see it when you're actually scrolling through this list as to what isolation segment each application is a member of. So, all right, thanks everybody.