 As Nathan was just starting to say, Nathan Scott is going to talk to us about developments in PCP performance co-pilot. Thank you. Yes, so I'm Nathan Scott. I work at Red Hat in the performance tools group. We do performance analysis tools such as PCP, which I'll be talking about today, but also other things like fail grind and system tap and one or two other tools that we look after. So our position in Red Hat is we work on the tools that are in rail. So we're sort of operating system and sometimes kernel people. PCP is probably the highest level tool that we're involved with in our group. And I'm going to talk exclusively about PCP and things that have been changing in PCP in the last year or so and perhaps a little bit further back that might be of interest. I gave a similar talk yesterday where I gave a long overview about introduction to PCP. I'm going to kind of skip through that today because I think a few people will have seen that already and it can take quite a lot of time to explain PCP. It's quite a huge software project. So I'm going to focus just a very quick overview of PCP and then focus on all the new stuff that might be of interest and perhaps in the questions we could come back to anything that I've skimmed over at the start. So PCP is a toolkit. We're talking system level analysis. So we're not talking about profiling kind of tools. We're talking about tools that can analyze multiple systems at once and tools that are aimed at looking at historical data, historical performance data as well as live performance data. And they're extensible. Talk about that. And fundamentally distributed. So we're talking about multiple computers at once being analyzed and interactions between computers. This is the architecture. I'm going to skip through this relatively quickly because it's not that relevant to the talk. But basically we have a collector system, which is any system you want to analyze. It runs this main daemon PMCD, plugging components that provide performance metrics, which can be of anything, anything you can measure. You can put into PCP and make available. And then you have monitoring tools, which are client tools, which then report that data or record that data or do something with that data, create charts or analyze the data, create summaries of it, whatever. But there's a very strong separation in PCP between this monitoring collector concept. I'm going to skip over this one except to just say the concept of a metric is core to PCP. Metrics are extremely well-defined. You know, everything there is to know about a performance metric when you're working with PCP. And when you're adding new metrics into PCP, you need to define them very clearly. And this gives that ability that PCP has to separate the client side from the server side or the collector side from the monitoring side so well. So let's get on to the main gist of this talk, which is really talking about things that have changed in PCP in the last 6 to 12 months or so, important stuff. I'll start out with just general stuff. And probably the biggest change in the most recent time is that PCP is now included in REL and it's fully supported by Red Hat people such as myself and the other guys that I work with. And it's supported in all of the REL7 releases and REL66 onwards. It'll be supported. We're seeing a huge amount of increase in activity in PCP. So this is a talk about performance, so I'll give you a graph down the bottom showing the sort of development work that's gone on in PCP in the last little couple of years. And that's kind of going back to when... For many years, PCP was bubbling away and developing slowly and nicely. Then Red Hat became very interested in it. So around just before the start of 2013, it was decided that we'd look into putting it into REL and then you can see a massive ramp up in the amount of lines of code. There's similar graphs for commit activity and contributors and it's sort of snowballed from there. So what that's allowed us to do is, instead of just keeping the project bubbling along, we're now tackling some of the very difficult problems that the project has had for a long time and long-standing feature requests that we've never really been able to get to because we now have dedicated engineering people working on some of these problems. I'll talk about some of them a bit later if I get time. We do very regular stable releases, so maybe once a month, once every six weeks or so, we pull everyone's working together. We have very stringent testing regimes around everything that we do and that lets us release relatively quickly and pull in new work. So there's constantly new people writing new collector pieces, for example, so instrumentation for new things that are in the kernel or new pieces of software that exist in user space. They're being added into PCP on a regular basis and being released in stable releases. And also, again, with Red Hat's commitment, we've been able to improve the out-of-the-box experience of PCP, so when you go and install PCP and you switch it on, it now does a whole lot more work than it used to do in terms of configuring itself and starting up recording automatically and records a good set, a good base set of data to start you off with. So it gives you good coverage of all the things that you might be expecting to have having used other tools like SAR and things like that in the past. So that used to take a lot of configuration and now it just magically works out of box, which is great. So, yeah, in the last six months or so, we've seen the introduction of... well, going back beyond six months is the HTTP JSON APIs that have been added to PCP that lets you access performance metrics that are available through... previously only through C++, Python kind of APIs. There's now a REST API and so we're now seeing sort of rich web clients starting to use PCP much more and people have written Graphite, Grafana sort of front-ends that sit in front of PCP and are able to graph PCP data very nicely in browsers, which has been a long-standing request. So that's one of those things that we've had for a long time, but most of us being sort of kernel people and very low-level people, we were less interested in doing that sort of stuff because we were using tools typically on the machine. So it's great to have got to that sort of thing, that sort of level now. So this was the topic of my talk yesterday. We've been doing work recently about monitoring containers with PCP with the goal of not having to install anything inside of individual containers but being able to look inside containers that you have running with PCP either installed on the bare-metal system running the containers or in a special privileged container being able to sort of reach out, look inside another container and improve a whole bunch of stuff for container monitoring. So the sort of stuff we had to do, so I don't think I'm going into a huge amount of detail here, but basically the client tools need to be able to tell because it's a distributed system as well, you need to be able to potentially say from one machine, okay, I'm interested in connecting to this remote machine, I'm interested in container named ABCDEF over there for that information to be transferred across to look up the information relevant to that container on the server so to be able to identify which C groups are involved and if the metrics involved require changing of namespaces like if a network device statistics, for example, that you're interested in, I have mounted file system statistics, that server system is then able to switch into the namespace of that container on the monitor system and provide that data back. So simplifying access to containers currently supports Docker only but it's written so that we can support any other kind of container implementation and basically just trying to make it easier to monitor containers from the outside looking in basically. So I'll talk about a little bit about the new collector stuff, so if you remember the design diagram from the start, PISP is kind of divided into monitoring systems and collecting systems, so the collector is the system that you're analyzing. And one of the strengths of PISP is the diversity of data that we have available and that's just something that's constantly growing particularly as PISP is getting more use even within Red Hat, we're getting more and more people in Red Hats operations teams and Red Hats customer support people now sending in lots of more additional sources of performance data that we can just plug in automatically and lots of things from the community as well. So there's kind of two main directions I suppose that the community people tend to take they're either using PCP for analyzing web stuff or things that they're doing in the web space and so we have additions to things like Apache, Elasticsearch, Memcashity and generally networking kind of additions from those folks or we have a whole bunch of other people working in the high performance computing space who are sending us patches that are related to the stuff that they're doing, running on very large ions so they have all their own job control systems and distributed file systems that they're interested in so they're sending a totally different set of collector pieces of code through to us so it's great to have all these different people contributing now so that's just the flavor of some of the stuff recently even just in the kernel so we have device application stuff we have the XT file system stuff cluster stuff, compressed swap and on and on and like I was saying at the very start it's designed to be extended so if you have metrics that are of interest to you your application that you're monitoring you can plug that data into PCP as well and that provides you with automatic access to all of the monitoring tools that are able to record that data for you and analyze it along with all the other system level data so yeah, there's been ongoing additions to the database server collector pieces so the MySQL and Postgres plugins have been bubbling away as well over the last six months and another big addition was the Python the PMDA stands for performance metrics domain agent so that's the plugin collector piece previously for a long time you could only write those in C or C++ or in Perl and more recently we've added Python to that set and we're seeing a big influx of Python PMDAs at the moment so the monitoring side which is the tools that report the data that the collector is being made available by the collector side lots of work going on there as well Python APIs exist there as well now so on both sides of the fence you can write reporting tools that are in Python so they consume the data that are being produced on the remote system on the collector system and we've started seeing people implement some of the standard tools that you might expect from a general Linux install like iostat, free, numostat and the advantage they have in doing that is they can now take those same tools with the output that they're used to or that their scripts are used to collecting and expecting a certain output format and run those tools on historical data so you can say, well show me the iostat output from 2am yesterday morning and compare that to what's happening right now in a production system there's new web tools like I mentioned before there's been lots of additions to the graphical tools so there's some QT based charting tools that ship with PCP in the PCP GUI package that we've had for many, many years and they're really a staple part of PCP and we've been doing lots of work to improve the usability of those tools people have complained in the past and there's plenty of work still to be done but that's ongoing but good gains have been made I mentioned before there is a setup so it automatically starts recording when you install and switch on PCP now which never used to do and lots of feedback has been taken from the Red Hat customer support people saying this is the data that we need that the customers are requesting that is available from their systems and that's all been fed back into the releases to make sure we have good coverage and the last thing I want to talk about is the progress that's been made in terms of taking data from other systems so PCP isn't just all about itself we want to be able to incorporate data and not just live data which is what the collector side of PCP is all about where you inject new metrics and new values as things are happening on the system but you can also take historical data like SAR data that you've got squirreled away for example or IS that data that you've captured from last week or have been capturing for the last 10 months or years and you can produce PCP archive format from that which then lets you use all of the PCP monitoring tools with that historical data that's about it so there's a few resources that I want to point you towards hang on a tick there's loads and loads of information and there are several books about PCP they're also now all available online under an open licence that is on the website I was interested in the instrumentation of DNS wondering how that works is it tied into a particular DNS server or is it looking into the resolver libraries what kind of information can I get from DNS using PCP just want that specific yep sure that particular example I had in mind was from a contribution that was made about a month ago where someone had instrumented was exporting the information from I think it's called the unbound DNS server and that exports a bunch of statistics already so there's a command you can run that produces a bunch of metrics so those are now any other questions go for it what's the level of granularity do you have on time and where does that come from the client side or the server side yep the recorded data for almost everything is in sort of get time of day level of granularity which is microseconds so time stamps and time stamping is done on the server side so as the sample is taken on the collector system so the client will connect to the server system that it's monitoring and will keep that connection alive for the length of the reporting tool every time it asks for a sample of a set of performance metrics the server system will take a time stamp at the time the sample is taken and send that back and that's a unix epoch time stamp that's sent back the reporting tools also have the ability to switch time zones so once that epoch time stamp comes back you can choose later when you're replaying the data to replay it in whichever time zone you like so it's agnostic to the time zone there's a few caveats to that so there's actually even finer granularity there's nanosecond level granularity for some concepts within PCP so PCP is huge like I said at the start it has this concept of event tracing in it in recent incantations so you can actually feed event data back along with your sampled data stream and that tends to require a very high level of granularity and so the time stamps that are associated with events are at the nanosecond resolution or can be done at nanosecond resolution as well any other questions Guy? Is there any way to integrate this into Nagios so some other monitoring tools so you can have all the graphs in the same place? Yes, yes there's lots of people doing that so the key piece of technology so once you've installed the collector so you have all of your machines in your data centers for example ready to go so they'd have the collector system running and be able to export data using PCP there's a program that comes with PCP called PMIE which is the performance metrics inference engine and that evaluates a set of performance rules that you give it so you make expressions about performance situations that you're interested in using performance metrics so it has its own little predicate calculus language where you can set up rules so basically say if certain things happen the values of metrics change in certain ways and you can be connected to multiple hosts so you can say the load average goes up to this on this host and the response time in the application goes down to this and this other thing happens and it's Tuesday then you can send a message so that you can run a program and run the NCSA program for example and send events into Nagios so yeah there's plenty of people doing that So would those results end up in PNP for graphing so that sounds to me what you've described there sounds to me like an alert in type thing into Nagios but Nagios also with the PNP for graphs I don't know of anyone that's done that that will plug in but there's now a REST API if you can plug in that way you could start doing that now Thank you Is it possible to visualize a collection of metrics for a given environment for instance in a cluster and put them all together in a single place to see say for instance the CPU load of a collection of servers at the same time? Yes, yes so there's several options there you can plot so the charting tools that come with PCP they create sort of a vertically aligned strip chart you can plot individual plots from the different hosts or you can also take that data in and create what's called a derived metric where you combine the values you might want to sum them together and divide by some other metric value or something that there's capabilities built into the client tools that allow you to do that what some people also do is create special collector system agents that sort of sit perhaps at the head node of a cluster take data in from all of the systems in the cluster and then monitoring tools connect to the head node and just pass out relatively so if you have a large amount of data that needs to be pulled in from each of the nodes then you can avoid sending all of that across the wire out of your data center for example Any more? You have to make it quick So the main purpose of this you know what's like it's intended basically to feed into things that people can understand bugs and performance problems or is it intended to be kind of feeding into the node the network operation center kind of view of what's happening in the data center So the goal of PCP is to enable performance analysis So feeding into NOx and alerting systems is really just one aspect of that Another critical aspect is the recording of data and keeping historical records of what's happened on machines or multiple machines but basically everything that's going on in PCP is focused on performance analysis So if there's anything that's needed to do performance analysis at a system level that's the goal of PCP No worries Okay join me in thanking Nathan for his talk