 Welcome to another edition of RCE. This is Brock Palin. You can find us online at RCE-cast.com. There's an RSS feed and you can follow our Twitter feeds on there, too. I again have Jeff Squier from Cisco Systems and the OpenMPI Project. Jeff, thanks again. Hey, Brock. Good morning. Good morning for you, at least. Good morning for me. And it's a good night for our guests. They're literally on the other side of the world for us. We very much appreciate them coming on what I think is about 10 o'clock at night for them and 8 o'clock in the morning for us. So, Brock, why don't you give us an introduction here? So, our guest today is Ken McDonald, who is actually located in Melbourne, Australia. And Ken is our first retired person who is working on this project for the love of it. So, this is a first time for us. Which I think is awesome. Ken, why don't you give us an introduction? Thanks very much, Jeff and Brock. It's a pleasure to be here. The hour is not quite as uncivilised as some of the calls that I've had to take over the years. My name is Ken McDonald. I've been a computer scientist for about 42 years. My earliest programming exposure was in Fortran 2 as a part of a numerical methods course at university. I guess that was an early precursor into the field that's become known as HPC. And probably was my last piece of HPC-like code that I ever wrote. So, I'm not an expert in HPC, but I have helped people in the HPC space deliver effective solutions and manage large and complex systems running HPC codes over many years. My main interest in computer science has been in performance analysis. I've been doing that for about 40 years of the 42 years that I've been a computer scientist. I've been using Unix and more laterally Linux on a daily basis for about 37 years. And the performance co-pilot, the particular project we're going to talk about today is something that I've been engaged with for about 18 years. So, given all of that history, you can understand why I am now retired. And I enjoy grandchildren, gardening, cinema, travel. Some recreational computing, which is mostly PCP development and a small amount of technology-based consulting. Prior to being retired, I worked for 11 years as an academic. And I worked in engineering and engineering management roles for pyramid technology, silicon graphics, and Melbourne-based company, Aconex. So, I need clarification on something. Fortran 2, did that involve punch cards? If you were lucky, you got to use punch cards. If you were not that lucky, you got to use paid potato. That is fabulous. Wow, I feel so ecstatic right here. Yes, the first programs I created were punched by my own fair hand on punched cards. Okay, so you mentioned performance co-pilot, which is the tool we're going to be speaking about today. So, can you give us a quick rundown on what performance co-pilot is? Sure. Performance co-pilot or PCP is the acronym. We've come to know it by is a collection of libraries, services, and packaged applications that's really designed for monitoring and managing the performance of complex systems. The focus is on systems-level performance, and from our view, that spans the entire stack of contributing components. So, from the hardware through the operating system to the providers into the libraries and the layered services, and then out into application space, and then distributed applications running on potentially multiple machines in a distributed application. Architecturally, there's a number of bits of PCP that are similar to other performance management frameworks and some that are different. It has a client-server architecture, producer and consumer architecture, so some bits of the code produce performance data and other bits consume it and process it. It has a plug-in architecture and libraries that allow new sources of performance data to be easily added into the framework, so there's no fixed collection of data. It depends on the platform and the applications that are running on it. There's a single API that can be used for the applications that are consuming the performance data to process both real-time and historical data. So, this is a really important differentiator that we can look at not only today's activity, but compare that with yesterday or the day before. And then there's a... The whole thing is a distributed architecture, so even though the concept of clustered grid and cloud computing didn't exist as computing paradigms when PCP started, the framework is already well-positioned to be able to support the full spectrum of platforms and architectures that are used for complex computation. And really, we were trying to automate the mundane parts of performance management and use the people to handle the exceptions. So, that's where the co-pilot piece comes in from the name. It borrows from the avionics concept that modern aircraft are very hard for humans to fly and they need assistance. The same is true of performance management in current systems. So, this is an 18-year-old project, if I recall you said right there in the beginning. Give us a little bit of the evolution. So, what were you doing back in paper tape days and punch card days that was critical for performance analysis? In punch card days, we weren't doing very much. It was a little bit later that when we started to talk about file systems and information management systems in the days before databases that we still had performance problems even in the file system. But the PCP Genesis really came quite a bit later in 1992 when I was working at Pyramid Technology and we were grappling with a bunch of performance problems that were being observed in early symmetric multiprocessor Unix systems. These were systems with relatively small CPU counts, 12 nodes, but already the operating system issues were getting to be really quite complex in the area of case effectiveness, VM and paging behavior, buffer and metadata management in the file system, networking in all the components of the networking, scheduler, oddness, context switch storms. All of these things were becoming much more complex and the conventional tools up to that point, things like SAR and VM stats, some of which still exist today and are older than PCP just were not giving us the insight we needed. So the idea started at Pyramid but there was no development at Pyramid. It wasn't until we moved to SGI that in 1993 we began building PCP for the IRIX operating system and the first product was released as an add-on SGI product in March 1995. Then there's a whole history of various pieces of porting of the code to different platforms for a whole variety of different reasons. Some were skunk work projects, some were for single customers, some were part of SGI products that ran on more than just the IRIX operating system. So we grappled with portability quite early on, endian issues, a whole lot of things like that. Then SGI embraced open source and the PCP components were first released as open source in September of 1999 and then over the following 10 years there have been additional releases of more and more of the pieces and almost all of the original proprietary code has now been open source by SGI and I should take a moment just to thank SGI. There was a good strategic reason for doing it at the time but SGI has continued to support the project and been very responsive to our requests to have additional components placed in the open source. So much of what we see today is really as a result of SGI playing a good open source community member in that role. PCP has become included in major Linux distributions knocking on from that open source work. It went into SUSI in 2003, Red Hat in 2007, Ubuntu in 2009, which was really a flow on from an earlier inclusion in Debian. It's an active open source project now with regular releases on both a major and minor schedule. So there's a lot of history behind this thing and it's done some different things. So if I'm going to use PCP, what's it look like? You mentioned it's a client server setup. What would I look to when trying to get performance information for my application? The first thing would be to decide which of the collections of data that you really are interested in because PCP is focused on system level performance. The application performance is just one part of the puzzle so there's a whole range of data, base data that's available. Hardware instrumentation like event counters and numerous interconnect utilization, perhaps activity and status from external networking components like routers and switches. All of the core operating system stuff is there that expects like CPU utilization, memory, disk file system interrupts. All of the networking activity, so everything that you can see, for instance, from net, stat in a Unix or Linux system. We export the data for every single process. All of the resource utilization for each process and then statistics from services. Some of these don't have an HPC component, things like a Java virtual machine or a database or a web server or file service, Sambra, Alasta, email service. Perhaps, well, maybe not in the HPC case, things like the virtual machine infrastructure, KVM or VMware. So there's all of that, if you like, platform-specific data. And then, in addition, if the application has been designed in such a way that performance data from the application is intended to be made available, it can be integrated into this whole framework. So you can see the application data alongside all of that other hybrid and very varied collection of information. So you mentioned a whole variety of platforms there. I have to ask, do you support Windows? Yes. So you mentioned a wide array of services and things like that. Has someone taken the time to go integrate deeply in with all the Windows services and equivalents because they're usually quite different than their POSIX equivalents? That's absolutely correct. And yes, we do have the Windows data. One of the things that differentiates PCP from some other frameworks is that if you think of a set of performance data as being maintained within the boundaries of a pool, if you like, it might be an operating system or it might be an application or it might be a database. The PCP component that is responsible for exposing that data is also responsible for exposing all the metadata that describes that data. So that's things like its name, the type of the data, the data semantics, its format, whether they're counters or instantaneous values, all of that information is exported and the clients are driven by that metadata. So a graph plotter when connected to a Windows machine will see the set of metrics that make sense in Windows speak. But the same tool could be concurrently connected to a Solaris system and would see all of the Solaris metrics. Now, for the Unix-based, Linux-based machines, there's a lot of similarity in the metrics. There's the same CPU utilization has the same kind of semantics and name across all of the operating system platforms, but that's not necessarily the case and is not required to be so. Where are these counters coming from? They're provided by the OS or do I have to install probes into my different applications? Do I need hardware drivers? Am I reading hardware counters? There's all sorts of information. What does PCP primarily look at? The answer to all of the above is yes. PCP is completely agnostic with respect to where the data comes from. So each of the sources of data has an associated collector plug-in which is integrated into the collection framework which knows about how to get the data just out of that source. So the Windows component knows how to make the necessary Windows calls to get all of the data that you can see through the normal Windows activity monitoring tools. All of that data is available. There are APIs to get it. And the plug-in calls that on Solaris. The agent makes calls into the Solaris operating system to get the data and Linux it makes calls into the Linux operating system or uses the exported data through the various PROC files and STAT file systems that are externalised in Linux. So it depends entirely on how the application or the collection data chooses to externalise the data. It may be via a system call. It may be via library calls. In the case of MPI applications and user space applications it's much more common for it to be exported via a shared memory segment. And so the application will manage a shared memory segment, usually an M-mapped file which contains the data and then the PCP plug-in will attach to that shared memory segment and retrieve the data from there. So is there a standard library that I can just put into my application so that it creates a shared memory segment or do I have to write this myself? Absolutely. There's a library. We started with the original MPI implementation which was built within SGI and has not been... It's one of the components that has not been open sourced but it's not that tricky. It wouldn't be hard to do it again if somebody wanted to do it. We just haven't had anybody put their hand up and ask for it to this point in time. But perhaps if I describe how that works and then I could describe the generalisation of that which allows arbitrary applications to provide very low overhead export. So the MPI implementation basically used the wrapper structure of the MPI libraries and redefined the outer layers of the wrapper library like all the other MPI performance monitoring tools basically do. So it gets control on the call at each MPI call. You get passed through an instrument, a piece of code before you go into the core MPI library to do the work. The PCP code that lives in that wrapper layer exports the data into these mmap files and then another component reads those mmap files. That worked so well that we generalised it and built an agent which is a generic memory mapped agent. It's called the MMV agent and it uses self-describing mmap files. So it basically pokes around, finds these files appear, opens them and goes, I think I understand what's in here and suddenly makes data available to the upstream consumers of the data. Now, the creation of those mmap files and the management of them, including things like atomic increment of a variable, set a variable to a value, create a new variable. Those kinds of things that an application would want to do are all abstracted away into a library. And so with a relatively small number of calls in the code to be instrumented, you can arrange for that data to be available to be used by PCP. It doesn't mean it has to be consumed by PCP, it just becomes available. So from everything you've described so far, it sounds like this would work even well in a multi-threaded environment, right? There's nothing technologically that would impose getting data from sensors from multiple threads in the same application. Is that correct? It's correct if you take the view that the threads would have to manage these mmapped files in a safe way. The PCP libraries themselves are not thread safe. That was a very deliberate decision at the beginning. It was based on performance considerations. We wanted to be part of the performance solution, not part of the performance problem. And so we were very, very careful about not bloating the functionality, being very lean and single-threaded was more than satisfactory for what we needed to do. We're beginning to reassess that and reconsider that in a forthcoming release. We may relax that a little bit, but the libraries themselves at the moment are very, very important. However, if multiple-threaded applications manage separate mmapped files or manage discrete areas of a single mmapped file, then everything will work just fine. So there's nothing... There's certainly your statement that there's nothing in PCP that would prevent you instrumenting a multi-threaded application is absolutely true. It would require a little bit of care and a line of that to ensure that the components of the threading were not standing on whether. Now, for the mmapped architecture to work well, you really want to have no locks, no system calls, and very, very low overhead to update the statistics in the mmapped file. That really means that each thread is operating on a private data segment, which is cache-line-friendly, and all the other kind of considerations you'd make in building a multi-threaded application. It flows over into the performance instrumentation, but you really want that to be lock-free and data structures that are separated for each thread. Once you've done that, the rest of it will just work. So how hard is it to export my own counters? Hardware or... Say I have some custom piece of hardware to export something in Proc. How hard is it for me to export those? I think the record was eight minutes from delivery of the source code to being able to graph the data. Good. That's great. Now, it is true that I'd done it before. That doesn't matter. Benchmarks are allowed to be gamed like that. It depends entirely on how complex the data is. One of the biggest users of PCP in a real-world production environment is a company called Aconex who have a web-based software-as-a-service offering. They have hundreds of machines in multiple data centers around the world, and they have a very large Java application which is instrumented with PCP, and they have gone to the level of instrumenting a very fine granularity components of that application in a very simple way of managing it in a very complex environment. At the other extreme, you have something that does crude instrumentation of an HPC code which might simply measure iteration counts, for instance, on a computation that's supposed to converge. So you could see the rate of progress or might expose the current data set size as a metric of application memory as opposed to system memory or process memory. Or in a transaction processing environment might simply count transactions. So the counters could be quite simple or they could be extremely complex. The PCP data collection model is a pull model, not a push model. So the fact that you have lots of instrumentation available doesn't mean you pay a huge overhead for it because until somebody asks for it, it doesn't really cost very much. If you have these low-cost instrumentation within the address space of the application, then there are no system calls or other locks or synchronizing primitives associated with updating the counters or the performance statistics. The only overhead comes when you collect them and if you can be selective about collecting them, then you can control the overhead. Ah, so this actually goes to exactly what I was going to ask you was with these M-Map files, how do you notice when the application has updated something and you're saying you don't. You're not pulling this memory all the time. You just pull, you might look at it and say, oh, these are the counters that you've made available back when it's created. And then when somebody asks for it, you can just go read that right then and there. You don't continually pull and check for changes. Is that, am I understanding you correctly? That's absolutely correct. And part of the metadata that is exported along with the metrics allows a client to say, oh, if the value of this metric is 100 now and I sleep for 10 seconds and read it and the value is 200 later on, then I know it has increased by 100 over two seconds so I can calculate the rate in the client not in the collection infrastructure. So by not doing rate calculations in the collection infrastructure, we avoid all of the time I wake up, run round, update everything, refresh your view of the data on the collection side and it becomes simply you pull the data at whatever frequency makes sense for the performance monitoring, performance management issue you're doing at the moment. So different data gets pulled at different rates. I see. So what are the primary sources of overhead then? So there's obviously some resources consumed, for example, for these M-Map files and maybe some network transfer. What else? What are the other main sources of overhead? That's pretty much it. There is a collector daemon that sits on the system under test or the system being monitored that acts as the message router, if you like, and request router. The clients request collections of data once which can come from multiple of these pools. The coordinated daemon takes the request, farms them out to the plugins that know how to instantiate each of the elements of the data, stitches together a single response and sends it back to the request inclined. Our design goal was that we wanted the infrastructure running on the system that was being tested. That's the collection infrastructure and the plugin components to consume no more than 1% of one CPU. That was a sort of design goal right from the outset. We've demonstrably achieved that across a very wide range of deployments for typical monitoring and management kind of scenarios. Obviously, if you ask for all of the process data for all of the processes every 10 milliseconds, that number becomes 100% of one CPU, but that's an unrealistic request pattern because nobody can make any sense of that kind of data sampled at that frequency. When using this with my own application, you said there was a library, so I have to relink my application to use it. That's correct. In the SGI MPI case, the MPI is a very special case because the MPI library is typically built as this wrapper layer with the real code in an internal core of the library. Any modern dynamic linker will allow you to redefine those outer symbols and you still use the library's inner symbols. If that's the particular scenario you're using, you don't need to relink or recompile anything. Changing environment variables or their moral equivalent will allow the linker to pick up your version of the outer library with the instrumentation and use it in lieu of the symbols that are in the standard library. But that's a very special case. The more general case is that you would need to recompile a code to enable this instantiation to happen. In extreme cases, the application has no instrumentation in it at all. It's not set up for this kind of monitoring, so you'd have to decide what it is you wanted to measure, where it was appropriate to update that data and then add source code calls into the library to make that happen and then conditionally compile it in or out depending on whether you wanted an instrumented version of the application or not. Let me throw in one little interjection here, a little advertisement for MPI-3. What you're referring to in the MPI-1 and MPI-2 world is the PMPI layer, the wrapper layer where tools can interject themselves into the code stream effectively. In MPI-3, I'm actually part of a working group, the tools working group, where we're making that even better because one of the big complaints with the PMPI system is that you can basically only have one tool interject itself, and there have been a couple of workarounds to that in some projects and stuff, but one of the big goals of our working group is to allow multiple tools to get even more data than they get today, so they'll actually be able to get internal counters and values out of an MPI implementation rather than just, oh, there's an MPI send and I'll snarf whatever data I can out of the parameter, so we're actually going to be exposing internal performance counters and configuration values and things like that and done in such a way that multiple tools can profile a single MPI process. So just a little forward-looking tip there. That would be an improvement over the situation in the earlier MPI versions, where indeed you only got one shot at the cookie in terms of a monitoring tool. You could have nothing or brand A or brand B monitoring, but they were your options. So what about a Pappy? Do you guys hook into Pappy to lower the overhead? You're just full of interesting questions, but actually the answer is no, aren't you? The Pappy framework, so let me digress and do a little bit of history. The IRX version of PCP provided system-level access at both the system level and the process level to the CPU counters in what, at that time, was the processor that was being used to build a system, which was a MIPS R-10000. But that work was all done before Pappy even existed. So we have some historical experience in doing that, and there's absolutely no technical reason why Pappy could not be used to provide an agent to export hardware event counters into the PCP framework. There's technically no reason for doing that. Nobody's decided to do it, but there's no impediment to do it. But if you take a step back, PCP would usually... The most common kind of deployment in compute-intensive applications would be... PCP would not be brought into the mix until after you'd done due diligence, I think I could describe it, in terms of CPU and cache efficiency. So things like cache profiling, using Pappy, using profiling with time or event counter measures as metatime characterisations to establish density of cache misses and the such like. That would all kind of happen first, and then you'd have the application which you'd scale up and deploy into production, and the performance would then suck, and so then the question would be, so what happened now when we scaled it up or went to production when we started using larger data sets or higher levels of parallelism deployed on platforms with many more node counts, and that's the place where PCP would really come into play because that's the place where you'd be looking for things like CPUs that were unexpectedly idle in what should be a highly parallel computation, delays, everybody's out of the pool while there's some bottleneck in the file system, writing temporary files or some other disruptive element of the computation, just plain badness in the networking or interconnect areas, particularly if this was a clustered or array platform, memory thrashing, context switch storms, long service times for remote operations like MPI calls, so at that point you might be interested in MPI at that level rather than PAPI, sporadic disruptive behaviour, things like garbage collection or buffer flushing, so it's those kind of events that PCP would probably be more useful for because they'd be things that would help explain why the application are low-optimised for in the analysis using things like PAPI and those low-level CPU efficiency tools to optimise that part of the performance. The performance still wasn't satisfactory in production environments or scaled-up environments and they're all things that are... The reasons for that are much more likely not to be described by event counters or described by the more macroscopic ecosystem kind of events around the application. So where I've used PCP, I've used it more for poking into what the hardware is doing, discounters, NUMA hits and misses, things like that where I've not redone my application, but we've talked a lot about linking PCP and exporting counters from my application. What is the most common use of PCP? What's really a strong point? Is it focused or system-level focus? That really depends who you ask. If you went to some of the clients that have used PCP, they would be interested in the platform-specific things so that would be their primary interest. But there are other people, A-Connex that I mentioned earlier is an example where their most important piece of code is not anything to do with the platform, it's their application. And so they're really interested in instrumenting their application and importantly seeing where there are correlations between degraded application performance and other events on the platform. So that correlation, being able to see side-by-side response time or throughput or service time or task completion time blows out at the same time that some other system-level activity, unusual system-level activity occurs, are really the places where PCP will be used. So it's a good fit in case where places where people will have big servers or a server farm with a varied deployment of products and servers and applications where there might not be a single mission-critical application but a bunch of applications where it's hard to attribute individual contributions to resource utilization. It works well in cases where there's no single factor influencing performance. I mean, if you have an application which is CPU-bound and does very little calls into the operating system, you don't need any of the operating system instrumentation to understand what's going on. It's all out in user space. But if you have a mixture of user space activity, rendezvous synchronizing primitives, along with operating system stuff, then it becomes a much more complex issue and so you're more interested in seeing or being able to see a holistic view. If the system is really complex, the copilot analogy really does... It is very compelling and a lot of people persuaded that using PCP to get rid of all of the noise and concentrate on the anomalous things is really extremely helpful. And I guess the other thing is that if there is a situation where there are small gains in efficiency or utilization that translate into very large cost savings, then that's the place where this sort of investigation is useful. It ends up being a human-based activity. You can automate a lot of the mundane stuff, but the hard and insightful things require humans. Unless there's a big return on that, it becomes a very expensive exercise. So it tends to be the place where there's big potential gains, either from a business point of view or a capital cost of equipment kind of budgetary consideration. So what's going on with the current development of PCP? Are you extending core functionality or is a lot of work going into adding new plugins for new platforms and new environments and new services these days? Or what's happening there? The core functionality hasn't changed very much. What's happening is that people are looking to extend that functionality in ways that we didn't first envisage. One of the examples we've done recently is that it turns out to be quite interesting to consider how you would support a set of things called derived metrics. So this is metrics where you don't have the base data, but you have some formula which you'd like to be able to evaluate and I like to be able to see the set of results from that formula presented as data. So we've added some functionality which allows derived metrics to be inserted kind of like an old-fashioned streams module in the flow of data so that data can be fetched, massaged and additional values exported. We're doing some work at the moment with some folks at Red Hat and some other places where people are interested in event tracing. And so we're looking at ways we can merge the traditional performance data with event trace data, projects like SystemTap, the ETW stuff or Windows. There's a number of projects now where people are starting to be much more creative about the insertion of dynamic probes into systems and the source of that data, the source of that data being viewed alongside the non-event trace data is an interesting research problem that we're looking at. We had plugins with kind of a regular basis because the architecture, because the plugins are self-contained and don't require API changes, there's very low resistance to the inclusion of new plugins into the source code base because they really don't impact anybody else and if somebody wants to use them then there's really no impediment to them just being incorporated. So we have a steady trickle of people coming along with plugins to address their own peculiar itches which often other people find to be quite useful. So PCP kicks out a bunch of information. I notice you had a tool, a PCP graphing tool that connects to the servers running on the different hosts you want to monitor. How do you go through and get information out of all these counters? Right, okay. So just pick up one thing. That graph charting tool, exactly the same tool can be connected to an archive of historical data and it behaves exactly the same way. So that's one of the nice things about it. If you write a tool, it will work. The same tool will work for both the real-time data and the historical data because of the way we create the archives and log the data in the archives. But to return your question of how do you wade your way through this, I mean we are talking about literally millions of pieces of data in any system that's a bit more complicated than my laptop. And so that's a daunting problem. The way to think about this is to think about the data being classified arranged along two virtual axes, if you like. If you think about a horizontal axis which measures or which describes functional areas, my application, disk, CPU, interrupts, memory, etc. And then a vertical axis which describes the level of detail. So you might have a summary data and then a summary data across an array and then data aggregated at the node level and then data aggregated at the rank level or the thread level within a node. So there's progressively more detail in that axis. Now most performance tasks involve you taking a very narrow, a shallow stripe across all the functional areas or a large number of the functional areas. And this is the summary data. What you're really looking for here is very crude thumbprints of anomalous behavior. Does the disk subsystem look like, kind of like the disk subsystem does when the system's running well? Does the paging rate look about what I expect it to be? Without any more detail than that, that gives you early warning of anomalous behavior in areas. But it also proves hypothesis you might have about nothing, that there is nothing abnormal happening in one of those areas. So a very small amount of data will usually get you that summary. But the more insightful data comes by taking quite narrow but deep slices through that two-dimensional array in which you take a functional area or a part of it, and because you believe or you suspect that there's a performance problem that you dig very deeply in that area. So you end up with a combination of this data. So although there's a huge amount of data, the actual amount of data that's usually required to do most analysis work is nothing like as complicated as that, as large as the whole set. So it's usually a small set. The tricky part is it's not the same set for each customer or each application or each platform. So it varies over time and between clients. All of the data is actually useful to somebody, some of the time, but none of it. There's no situation in which all of the data is useful to anyone at any one point in time. The things that we add that allow us to deal with the volume of data in a slightly more expeditious way, we have a very powerful inference engine that can consume very, very large amounts of data evaluating predicates to describe goodness or badness in terms of overall performance. So they can be used to very quickly raise alarms or tell you how frequently something happened in the historical data or in the real-time data, make operational decisions about to alert you and allow you to do things to try and remediate bad situations. Visualisation, the 2D visualisation we talked about is helpful as a graph plotter, but there's even more insight that comes from 3D visualisation. We did a lot of work with this in SGI. Unfortunately, the tools tended to use libraries which were proprietary to SGI and in trying to port those to their open source equivalents, it's become a little bit of a slog, and so we haven't made as much progress on that as we'd hoped, but there is the potential there for 3D visualisation which is much more insightful, much like the scientific or HPC use of visualisation for providing insight into very large data sets. And logging and retrospective analysis is another way that really allows us to deal with the volume of data to be able to look at and process today's data and compare it with yesterday's data or last week's data or the previous version of the application which is being a standard data set or executing in a typical window of the day's processing cycle. Those things allow you to identify things that have changed and it's often the things that are changed or are different which give you the most insight. So there's a fair amount of assistance there for some automation, but it does require the insight of an analyst to make some informed and intelligent decisions about what to look at. So can you give us a real world example where someone was able to use PCP to find a major performance bottleneck and resolve it? So I'll give you two examples of where PCP has been used gainfully to provide insight that probably would have been difficult to observe otherwise. One was in a very large MPI application deployed on a very... At the time, a large SGI pneumo-linked machine. So a machine that had... I don't remember the exact number, but it would have been 128 or 256 CPUs connected via pneumo-link, global shared memory, all of the MPI stuff was running in the shared memory so there was no networking interconnect, so very low latency in the MPI calls. Now this code, this particular code was running and it would run very, very well and then there would be periods in which the performance would be really poor. It would make very, very slow progress or what would appear to be very slow progress, the run times would extend. And so the final insight to this was gained when the application was instrumented just simply to export the major iteration count. We exported the MPI data around the MPI calls that could possibly be blocking. So the ones that required synchronous response and rendezvous, that sort of class of MPI calls. And we also looked at some high operating system, very high level operating system stuff. What we could see from that was that when the application slowed down, there were periods where it definitely slowed down so the iteration count went down. The MPI, there was no noticeable increase in any of the MPI synchronizing primitives, but what we did notice was that there was a lot of disc activity and this was kind of strange because the application was not supposed to be doing much disc activity. We had to drill down even further once we knew that, looked at the disc activity and it turned out that the problem was related to flushing of file system buffers for some very large temporary files and some faulty algorithms about flushing those buffers which were forcing physical rights to happen when they didn't need to happen. So it was eventually a file system problem, but it manifested itself in application space and it was only when we could see the file system activity alongside the application activity that we actually got inside into the problem. A second example would be in the A-Connex deployment that I've referred to in which a Java application has been instrumented to count not only the number of times of the methods is called to service web request coming from the web front end, but they actually measure the service time for those and then they also measure in another machine running another operating system, running the database product, the service time for the database applications and they also instrumented the Java virtual machine that the Java application was running on. They had three separate pools of instrumentation in addition to the platform ones and they identified two class of slowdowns which were completely different. One was related to congestion in the database associated with a particular query which was not often executed but had a very adverse locking behavior on a critical table causing concurrency in the database dramatically and the second problem that they discovered in that was that very, very poor response time was often associated with major garbage collection by the Java virtual machine running on a huge address space. That's another example in which without being able to put all of the bits of data together and look at them in a time consistent fashion you wouldn't have been able to gain that sort of insight. So here's a question I like to ask other open source developers just because it's a particular interest of mine. What version control system do you use and why? So we started with RCS. You have to remember this is a long time ago. Yeah. Then we migrated to some internal tools that were being used by SGI to manage their source trees which were based on RCS but had a series of frontends around them. And then when we embraced open source and changed from RX being the primary operating system to Linux being the primary operating system we, through a series of tracking the Linux community we ended up being Git based. So the trees are Git based at the moment and that works well for us with our distributed developer environment. I think they're all largely similar frankly having lived through many iterations of them. They all have idiosyncrasies and strengths and weaknesses. It really has a lot more to do with how disciplined the users of the source code control system as I choose to put into a single commit and how well I document it seems to have more to do with the value of the source control, the source repository than the mechanics of which particular bit of software is managing it. I think that's probably the best answer that I have got. That reflects quite a bit of wisdom there. Okay, so Ken we're going to wrap up here. What is the website and mailing lists and information downloading PCP? Where is that all at? Okay, so SGI continues to act as the host for the project and their open source repositories and mailing lists, et cetera, run off a host whose name is OSS.SGI.com and PCP can be found below that in slash projects slash PCP. And you'll find there links to the mailing list, to the mailing list archives, so to join the mailing list of the mailing list archives. There are binaries and source versions that are maintained there that are available. If you are using any of the standard Linux distributions, they all have not the latest but not old and mouldy either. So quite usable starting points can be obtained out of the standard repositories for Debbie and Red Hat, Susie or Ubuntu. Well Ken, we thank you for your time. We didn't necessarily start in an ungodly hour for you, but we're rapidly approaching that. So this was great stuff, quite fascinating, and we really appreciate your time. Thanks. Thank you once again for the opportunity to speak about my passion. Thank you. You have a good evening.