 Okay. Okay, let's begin. Good morning. My name is Mike Christofferson. I'm the Director of Product Marketing at Anea. And I'm presenting a proposal here for a time-based performance measurement system, a clock cycle-based performance measurement system. The idea was originally conceived by one of our engineers at Anea Daniel Bornaz, who's not here, but I am, so I'm presenting this for you. What this is is a work in progress. We've actually just sort of started this. And before I get into it, the idea here is to come up with a very simple and easy-to-use tool for measuring time between discrete events anywhere in the system, whether it be in the kernel or in user space. And we want it to be simple, easy to use, easy to install, very little configuration, and get directly to the heart of the matter, whereas to measure the time in clock cycles between point A and point B somewhere in your code. So, let's get started. First of all, let's just talk a little bit about performance and profiling performance. We're basically here interested in time relationships, and we're interested in finding execution hotspots, characteristics, behavior of the system by doing some nice time-based analysis between events to identify timing issues, and also build some statistics about time between discrete events if it's repeatable in the system. So, you get some statistics of the behavior, and we'll talk more about what that means in a bit when I get into the proposal. Well, profiling, I think we all know what profiling is. It's basically for software performance appraisal. Runtime checking, program checking, static code coverage is profiling of your application in some sense. Runtime code coverage, logic analysis, performance analysis, and there's many, many kinds of performance analysis, as we know system event utilization, analysis, you know, cache hits, misses, event tracing and logging, you know, tools like ftrace, LTTNG, CPU utilization analysis, you know, like O-profile, memory utilization analysis, and performance there, program thread function, you know, like Valgrind, execution flow analysis, runtime error identification, a lot of different ways of doing performance profiling. Perf is another example that we can talk about. But this, what's missing is a tool that just very simply provides direct time measurements in clock cycles from point A to point B, so you can get to the answer or data immediately without having to do exhaustive trace analysis and pull up tools to upload the traces and do analysis on the traces and whatever and so on and so forth. And so this kind of tool is, the idea here is something that is easy for the programmer to use and deploy very quickly without having to read 100 man pages or documentation to figure out all the different kinds of configuration options and so on and so forth. It's really just about time between discrete events or a chain of events and whatever. Part of the idea is that we want to provide a tool that is not just easy to use, but that can help identify program correctness and specification correctness without too much overhead so that you can really quickly and effectively, as you're developing your code, figure out what real times are being, between events are actually occurring. So it's primarily used in the development stage as part of the solution, verification of your code to see if you're somewhere close to your specification initially and whatever. And again, the idea is ease of use. So we also want it to be to operate in both user space and kernel space or both. In other words, time events between an event in kernel space that might trigger ultimately an event in user space and so on and so forth. And we want it to be seamless so that we don't have to have separate kinds of tools for kernel space for user space or unusual kinds of configurations to do those kinds of things and so on and so forth. We also want to measure frequency of events, not just the time between two discrete endpoints or multiple endpoints, but also the frequency at which each of one of those is being hit. So let's just go, and again I mentioned a couple of these, measuring performance, the usual suspects, we all know what these are. They have different kinds of characteristics and setups and so on and so forth, different kinds of impacts per profile, you know, Valgrind. I don't want to go through this entire list because all of these tools are very useful, especially when it comes to timing measurements, especially LTT and G on F-trace in the kernel. I didn't mention F-trace here because it's primarily for the kernel, but LTT and G with UST can be used in user space as well and so on and so forth. And it provides very, very powerful profiling information and timing information in the case of LTT and G, but it's also very complicated setup and provides a lot more kinds of deep analysis when you're tracing events because you can tag certain events or information in the traces and as we all know and so on and so forth. But we're talking about a tool that just simply and does nothing else and just measure the time between events so that the programmer can get right to that information online while he's running without a lot of reconfiguration of the system or so on and so forth. So let's talk a little bit about time measurement. Well, there's software timers, there's hardware timers. Usually hardware timers are CPU dependent, so that means that if whatever solution you come up with, if you use a hardware timer, it will have a different implementation on different architectures. Our solution, the proposed solution, is portable because we're going to try to abstract away the hardware analysis or the hardware timer measurements. And what we want is something that has very high resolution, accuracy, very low load on the CPU and portability, again, as I mentioned, yes. Well, resolution is really about the resolution of the timer, whether it's the smallest unit of time nanoseconds, in this case clock, and what I'm going to be talking about is clock cycles and so on and so forth, or whether it's microseconds or whatever. Accuracy is how accurate the actual clock is in terms of keeping time. So resolution and accuracy are different, especially in a multi-core device. There have been some issues in reading the clock counter registers, having them completely synchronized across cores, and that's an issue and so on and so forth. So resolution and accuracy mean two different things. So hardware timers, and that's what we're going to focus on in our solution. They are CPU specific, they're efficient, and usually have great accuracy, but with notwithstanding some of the multi-core issues I've talked about, but newer multi-core devices are starting to solve that problem in terms of accuracy and accuracy across all the multiple cores in a multi-core device, and it requires kernel support usually. Our first implementations that we're going to be looking at are on ARM platforms. ARM v7AR, ARM 11 processors have performance monitoring unit, accounting CPU events that a profile uses, statistics on the CPU and memory, the clock counter register, overflow, interrupt generation, all kinds of support that a lot of the other tools that performance profiling tools use. In our case, since we're really just focusing on time, we're just going to use the clock counter register, the CCNT, for ARM. And we're going to be moving up to, and I didn't list the A15, ARM A15, but we're going to be implementing this on that processor too. And if we look at x86 platforms, we're basically going to focus on the time stamp counter, which is present on all the x86 architecture. Things, they have high precision event timers and whatever, but with interrupts, but we don't want the overhead of interrupts in trying to figure out time scales. So we intend to use the time stamp counter on x86 architecture. So let's get into, so that's just sort of a preface to the actual solution that I'm now going to talk about. In other words, end-to-end time as measured in clock cycles between discrete events. And we chose clock cycles rather than trying to resolve the time because what's interesting about clock cycles is that then you can do apples-to-apples comparisons of say two different, the same application running on two different pieces of hardware and under various scenarios look at just the clock cycles where then you can do your own translation of time and performance based on the clock cycle count. And also that just makes the profiling tool a lot easier to use less calculations in the profiler. So an almost perfect time profiling system, what do we mean by that? This is sort of our list of requirements for what we thought would be a useful and interesting tool that could complement some of the existing tools. Accessible in both kernel and user space. Easy to use instrumentation API. Restricted to just a very few functions. Again, we're not trying to solve every kind of performance profiling problem that there is. We're trying to do something that is very core and very basic to a lot of program development in a very, very simple and easy to use way. We will, of course, make use of hardware support on platforms that offers one. We're actually going to focus on those processors alone initially but if the idea could possibly be ported with software timers or something on processors that don't have that, but we're not going to focus on that now. I think the key thing about our proposal is just how it works and what the model is and what kind of data is provided and so on. We'll provide statistical and time computations on the fly unlike, in other words, unlike most tracing tools or tracers. You just get the tracelog and then you have to then go in and analyze those tracelogs to look at timing relationships between events. What we're going to do is actually calculate the time between events right on the fly and then do statistical and do some statistics like average time between those events as the system keeps running. Minimum time, maximum time and totally lapse time and stuff like that and we're going to have it done on the fly. So if you're looking at a timing relationship between point A and point B then at any time you can just with a simple CLI command just type in a simple command and then get that data right there. Not a lot of complication in terms of the way we tend to use it. Yes? Not at this time but I'll address that later. Our initial, we're starting out very simple. It's a very, very simple CLI interface. I'll talk about that a little bit later. But because we're actually going to have a wealth of data in the system one could build more sophisticated display mechanisms on it. We're just getting started with this. So we're not going to try to solve every problem in the world if the idea takes off and it looks interesting after we've tested it, make sure it works okay and of course it's going to be an open source thing. We'll put it out into open source and just see what people want to do with it. But before throwing something out there that might be stupid we want to build a prototype, find out what its own overhead is because that's important. You have to understand the overhead of the tool so you can understand the use cases that you would find it in. And then perhaps you can build even more sophisticated analysis of the data that is being collected as time moves on. So we'll see about that. We'll have a minimal auger trace option and we're actually going to have some trace data in the system but we're going to use it to perform computations on the fly but it's that trace data that could be extracted just like a lot of other tools do and then have some post-processing analysis on it. We're just going to provide the trace. We're not going to do much with it other than just on the fly computational analysis. You'll see what I mean when I start talking about the implementation. Should, now those are the will or the absolute requirements. Now hopefully as we develop this it should have low overhead and performance impact. Should be easy to port to new platforms. We think we can make it that way. Should have a very simple instrumentation implementation meaning when I say implementation I'm not talking about the user APIs that define endpoints between two or multiple events. I'm talking about the actual implementation, the internals of the implementation so that it's very easy to use minimal or almost no configuration in the system. That's an interesting concept but I think we'll see that that actually can be done. Well that's what simple instrumentation implementation means. So these are again going through the requirements. Should offer continuous profiling options or just a single snapshot, single event. A one-shot. If you're doing something where you're setting up something very specific and maybe you're debugging instead of letting a lot of data flow by you just want to take one shot and look at it. It could be a useful tool as you're trying to develop code in your system and you want to look at performance options as you're developing. Again we view this as a tool that is primarily of use in development. It should be based on a very simple architecture so that it's easy to implement, easy to understand and therefore easy for others to go in, make modifications, add to it and so on and so forth. So in reality even though we're going to have something that looks like a trace, our intent here is not to really just do another kind of tracing mechanism. Again the primary focus is on the fly calculation of time. So just the architecture of the implementation. The actual profiling is done in a kernel module and I'll show you an expanded view of this in just a moment where there's just simple header files for both the same header file for both kernel and user space that applications just use that provide just the simple APIs for defining the relationship between various endpoints in the code. And I'll explain the whole schema in just a few moments. But the idea is the way you would use this is you have a module, a program, and it could be in the kernel so you'd have to instrument the kernel, link this header file to it and then that will provide you the APIs which are basically just macros that will then call into the kernel module. So very simple, really easy to set up. And then of course the kernel module, all of the information is kept in the kernel module so that's why we're going to be able to actually use this both in kernel space and in user space and whatever and have a consistent coherent data store in there. Now that comes with some penalties as we shall see but again this tool isn't designed to solve every problem but I'm sort of getting ahead of myself but it can solve a lot of problems, we'll see. And then we want a very simple user space utility that can be accessible from a CLI to basically set up the data store and then manage the profiling activity mainly by extracting the information that's being collected. So now let's look at this in more detail. We have the profiling kernel module that actually has the data storage in it, in memory and we call it the statistic data processing unit. It's just basically just a program that does all these on-the-fly computations and calculations and so on and so forth and this kernel module basically looks like a pseudo-device driver. So it just looks like a pseudo-device that therefore can be accessed from either kernel or user space and again the idea here is whether you're in the kernel or whether you're in user space you just have the simple APIs that are provided by header files that then call either directly the device driver if you're in the kernel or call the device driver whether in user space. We're going to have a hardware abstraction layer to the performance registers that of course platform dependent but by doing that then we can make this relatively portable and again I think the thing that's interesting here is that since we're really limiting the scope of what we're trying to do here not trying to solve all the problems that you need other tools for that we're just interested in time then the hardware access is relatively simple and straightforward. Relatively I should say. And then of course we'll have profiling information, retrieval and display. That's the program that sits in user space that you can talk to and access information, the performance profiling information that is in the kernel module and so on. So that's sort of what it looks like from a block diagram point of view. Just a little note about the instrumentation API. As I mentioned the same APIs for both kernel and user space based on basically macros. No libraries, no static or shared libraries between modules so it's just macro calls that you just insert into your code that call this pseudo device driver with some parameters that you supply and that's it. Because, well the communications with the kernel module basically non-blocking there's nothing in and as we shall see when we're looking at the data structures there's nothing in the way we manage the data here that will require any kind of blocking or that could be due to resource contention in the data. So every call in there is non-blocking which is almost a requirement because otherwise the overhead could be severe. So we'll see how that works. So when I actually show you the more details of the implementation and the architecture we had to further the discussion we have this concept of profiling points which is just our own term for this and I think it's used elsewhere but profiling points are the actual points in your code or wherever in your code, the points in your code where you want to measure time between. Each profiling point will have a numeric identifier a group identifier and a mask to indicate the statistical operation to be performed whether it's just min-max time, average time, frequency those are just some of the initial ideas that we have so we'll have a little bit of selectivity in terms of what you actually want to profile but it's all about time, namely time between events average time, min time, max time, frequency that the events are occurring at and you can select any or all of those options as part of your single call. So literally for profiling purposes there is one single call that you just insert into your code with the parameters that identify this point this profiling point. You'll understand that better when I get to the actual implementation of how the data looks and what these profiling points look like and how we actually contract time and map events sequentially to each other but let's talk a little bit more about the implementation in kernel space. The kernel module is loaded at runtime on demand so again you don't have to configure anything pre-configure anything you just load this thing in and what it does will basically when you load it in it will set up the basic data store that you want the size of it and the size is configurable and the size has to do with how big the trace buffers are that we're actually going to use for computational analysis so that's completely defined at runtime. A single instance call implementation for access to it again the hardware abstraction layer kernel module profiling can be done but really really the use case there are so many other better good tools for that ftrace for one so our interest here is not so much in profiling between events inside the kernel it's profiling between events in the kernel and user space something happens in the kernel and you want to see which is causing something in your application to be invoked or happen and you want to see what that time is or what that frequency of events are so that's really the primary use case implementation in user space is virtually the same as I was just showing you this is just a little bit of more data we can we're actually later we're going to implement some of this trace buffer offloading onto disks so that one can actually build more sophisticated tools with it but again we think that right now the primary use case is just simple quick access to the data on demand the time data on demand without a lot of extra hassling with all this but all that stuff is possible and I think I mentioned all the rest of this before it's a pseudo device driver so it's a very simple implementation and of course it's accessible from user space as a pseudo device driver now we're getting into the real idea here this tool will be able to measure not just the time between two events but multiple events in a chain in this example profiling point one hits and there's a direct causal link between something that when that point hits another event later profiling point two and so on but there has to be a direct causal relation between the flow of events between all of those points in other words you can't have profiling point one hit and then three hit and then two I mean you know you have to understand your application in order to use this I'll talk about it we'll see it a little bit later but I'll mention it now is this concept works best when each of those profiling points is in a single thread it's not a multi thread access to those profiling points why? because if there's multiple threads hitting them then they could actually you're not looking at the time sequence that you think you're looking at but if each one is controlled by a single thread so there's only one entry to it and there's a direct one to one mapping or causal link that is absolutely in this sequence where it can't be reversed and there's a lot of things a lot of programs are like that cause something to happen here and it does some transformation there's something somewhere else and so on and so forth and there's a direct one to one mapping sequential causal link between all those points so that's how the idea is now for each one of these profiling points one two three and n will have an event group ID that basically defines this sequence and you can have multiple event groups so you can have say two end points in one event group three end points in another event group because you have to identify the group that it belongs to for you to define the actual sequence of the points that you're trying to measure and for each one the idea is very simple when profiling point one hits we mark the clock compare it to the first entry and then calculate on the fly whatever it is we're trying to calculate the time difference whether we're doing averaging or just min-max a very simple set of computations just looking at min-max and whatever between point one and point two doing the average keeping the number of counts there and so on and so forth all the way through the chain including once we get to position profiling point n the time it takes to go back to the next sequence of events now there's something really important here in a flow it doesn't work if you have out of sequence order like pp3 comes before pp2 in a single flow but consider the case of most kinds of processes there could be queuing or buffering effects like say take an interrupt and interrupt hits and you put a profiling point one there before and profiling point two is something that eventually gets triggered in user space well what happens if a second interrupt hits before the profiling point two the first instance of profiling point two gets hit and this happens often in flows like data transfer data transfer kinds of flows like in communications devices you have interrupt packets coming in packets are flowing through this whole thing you could actually have out of sequence event not that the individual sequence of an individual packet is never out of sequence but you can have multiple hits on pp1 and pp2 and pp3 sort of out of sequence so how do we manage that well it's actually very simple and this is where the trace buffer comes in what it is is if you look on the right the trace buffer for every event group this is just one for every event group we have pp1 through ppn whatever it is and we keep the time clock measurement for each instance of the hit to it so every time pp1 is hit we just add it to the list in a circular buffer on the left you see look at that pp1 list pointer that is the pointer to the next hit of pp1 so we're basically just filling up the circular buffer down and we're doing the same for pp2 and the same for ppn so if everything is sequential in a flow pp1 through ppn then all we have to do is match up the first pp2 event with the first pp1 event the second pp2 event and the second pp1 event and so on so in other words we're capturing we're mapping the occurrence of the events so that everything in the flow is in sequence let's go back here again if I took this use case and said suppose I hit pp1 twice before the first pp2 hit well because I'm counting I'm logging all of those which ones I can map the first occurrence of pp2 with the first occurrence of pp1 and then the second occurrence of pp1 will trigger a second occurrence of pp2 so I'm just matching the occurrences why? because I know that this flow always works exactly in this sequential order think about it a little bit and it actually works so given that we can actually do a really nice job of mapping time across multiple events through your system and manage it so that if the rules are not broken if the actual sequence of profiling points is guaranteed as the way the program is designed then you can map the complicated flows where things don't literally happen you know pp1, pp2, pp3 all happen before the second occurrence of pp1 that would be the that would be a case where you didn't need all this buffering and so on and so forth but that's often not the case in a complicated system so that's how we manage the actual coherency between the actual data the data, the profiling points now notice this thing on the right that actually is a trace buffer that actually contains detailed information just like any other kind of trace we could extract this trace buffer and then apply a lot of tools to it to do more fine-grained analysis of individual events in the sequence order that we're talking about looking at the relation between certain sections where of pp1, pp2 or the second occurrences of the profiling points so this actually is a trace buffer that could provide useful information even more detailed information like building histograms of what's happening through time in terms of performance and whatever whereas our initial tool here all we're doing is just doing either average min, max or frequency and elapsed time that kind of information is really useful to have and you're not looking for anything else not looking for a lot of other complicated relationships so we're providing a very simple way to get to that but you can do much more sophisticated things the user space utility is basically going to be a single binary file with installation scripts or some customization to user space and it then can provide the interface to a CLI or maybe if you want to build other interfaces into it for other kinds of access whatever you want to do but we're just going to start out with a CLI and it's going to have direct access to the sudo device driver that contacts the kernel module that then pulls out all the information and displays it for you right on the spot so again, no complicated configuration necessary in the system don't have to set any compiler flags don't got to do very much at all actually so it will have certain other kinds of functionalities like enabling and disabling profiling so that you can turn it on and off that's a very common or simple kind of concept that we want to do can extract the statistics for individual profiling point event group or the entire all the event groups, any sort of combination of the amount of data that you want to do remember that those trace buffers this data store here is per event group, so it can get very large if you want the trace data then it can get very large but the thing about this is that you pre-configure the size of these things, you pre-configure the size of the circular buffers you pre-configure how many event groups you want so that you don't run into problems of collision when you're trying to set this thing up when you create a new event, you say I want to create n event groups in advance and then instead of, you don't have to do any kind of create up these data structures and then your individual profiling point comes in and says I'm event group three and I'm profiling point one well the data store is already pre-built and it just starts working so again, you don't have to have sort of a pre-configuration that dynamically creates event groups, just create one big enough that's going to hold all of the profiling points in event groups that you already have loaded up once and it will start working if you actually have the profiling points in the code then the kernel module hasn't been loaded then the code will just skip over it because the kernel module doesn't exist fairly straightforward to manage one other thing I forgot to mention before if you think about this and I was trying to think of a way to depict it but couldn't really you find out that there can be multiple accesses into this data store into the kernel module in a multi-threaded environment from different event groups maybe something got suspended right in the middle of this and then another thread comes along for another event group comes in and starts messing with it but if you look at the data structures in a way they're set up there's actually no need for any kind of special lock because none of those either profiling points or event groups are accessing the same data at the same time because each event group and each profiling point is separate so there's no collision between updating some writable entry and somebody else reading it and getting out of sequence it actually works if you think about this so you're going to have to look at this a little bit and think and whatever so that means we don't have special locks that means that we're not all operation is a blocking is a potential blocking operation in going in because at the most you're just reading something that could not possibly be re-entered while it's writing again because of the time sequence of events if those sequence between pp12 and 3 and 4 and whatever is inviolate that everything between those always happens in that sequence then nothing gets re-entered here so that's an important point so summary it's a simple to use tool between multiple endpoints in a serial or causal sequence of events that you've defined and as I mentioned it's very simple CLI calls probably just our initial implementation will just be a handful of calls to get information out of the data and the information is immediate the average time between these events that you know over long elapsed period of time and whatever can be very very useful very quick to implement and so on almost zero configuration required as I mentioned before we don't have a concept of thread awareness the concept might work if multiple threads are actually hitting some of those profiling points it might work but you have to really think carefully about that often it won't work so we haven't solved that problem just yet so thread awareness is something that we might want to start thinking about and whatever because you want to correlate which thread caused something to happen in another thread so you have to correlate the threads then not just the hitting of the profiling points complicated problem so we're not trying to solve it this works in a single threaded environment where each profiling point is managed by a single thread yeah you're going to have a problem of multiple yeah that's right you're going to have a multiple yeah like I said this doesn't solve all problems and if we can solve once we get this working here's another use case that actually works pretty good interrupt handler there's only one source for it it's a hardware so it's not being re-entered just hardware interrupts are being queued up by the hardware and whatever so that's a case where you can mark it there and hit something that's happening way off in user space a long time later just very quickly I want to mention so that is an issue a very good question we haven't solved that problem but again that doesn't mean it's not useful for a large variety of problems overhead we don't know what the overhead is here we're a little leery that we're using a syscall type approach from the user space which you know has a fair amount of overhead and whatever when we implement this we're going to of course measure the you know the performance overhead and see where it sits my own I believe that this is not necessarily this tool is not good for measuring function entry and exit and if the function is it doesn't block for long periods of time because this tool might have more overhead you know it might have enough overhead to really skew the results but again remember so that's we're not trying to solve function entry and exit and do all the profiling things that Perf does and all those kinds of things we're looking between discrete events in the system and if those events are often you know in if you're really looking at complicated systems those events can be microseconds apart or many or milliseconds even or whatever when the time scales are large compared to the overhead which is more overhead than say LTT and G for a trace point but less but if the overhead is less small enough then it works very well for larger scale timing issues and I'm not really going to argue with you whether it's necessary or not you believe it is so that's fine but instead well except for the implementation that we can discuss for ages as well is I want to challenge one fundamental assumption you're making you have basically done the cycle counter and you've seen to believe that there is a direct correlation between the cycles in the system and the processor and the pipe and this assumption is fundamentally broken that's number one and the number two is that you've seen you've mentioned that you're comparing both cycles it's like comparing apples to apples and again this assumption is not really valid in the organist system we could just... yes yes you're actually kind of you're actually right there you don't have a direct relationship to time which is why we're but it gets still when you're looking at your application it gives you very useful information by looking at the clock cycles between so when I say time I just every time I said time I don't really mean time I just said clock cycles sorry we're prototyping this now and the tool might be useful for looking at development a wide variety of development problems or problems that you're having in development to see you know how relatively based on clock cycles how long how much time is in between events you know that's a good idea I'll mention that to Daniel he came up with this idea because it was really easy to implement you know first off very easy to understand and yeah we'll look into that and it's a good suggestion just our first entree into this ok well thank you they're telling me to stop I think we've just run up against the time very much appreciated if you have any further questions you want to talk about this some more albeit the ANIA booth ANIA has a booth in the main area come by and talk to us about this and we do other things too we have a commercial distribution of Linux that we offer lots of interesting real time extensions lots of interesting work that we're doing so thank you