 Okay. Shall we start? Hi. I'm Mark Wielard. I currently work for Red Hat and they actually pay me to hack on SystemTab, which is fun. Some of you may know me from some of my Java work and if you were thinking, oh, Mark is going to talk about Java that's going to be a bit disappointing. There will be one example at the end using SystemTab with Java but all the rest is boring old-school languages. The talk itself is in two parts. One about SystemTab itself. Why do we need it? What can it do? Some examples. And the second part is also about SystemTab but what we need besides SystemTab to make it really useful. And what you could do with your programs to make SystemTab and observability on Knudlinix better. And then there will also be a little demonstration of indeed with Java how you make something like a Java runtime more observable with SystemTab. Well, it's not really a circle but kind of a circle. So why do we want to have something like SystemTab? If you are looking at what your system is doing there are a lot of nice tools. You can do profiling with just the top to see what running on the system using O profile or the new Perf tool. You can use some tracing tools as trace to trace your system calls, L trace for library tracing, the new F trace tool to trace kernel function calls. And of course if you then have figured out what is going on you can drop in to the debugger and really inspect what is going on. Kind of the problem is that these are all separate tools. And profiling tools often work system-wide while tracing tools work one application at a time. And with debugging you often just stop your program to inspect what is going on. And with SystemTab we wanted all the advantages of these programs without any of the disadvantages. Of course that also means with SystemTab you give up some things you can do with the other tools. But we hope that with SystemTab you have at least one tool that keeps you going for a while without having to switch between these things. So the tracing tools. What is nice about tracing tools is you can see what is happening while things are running. It is often faster and nicer than printf debugging because it is fully automatic printf debugging. You get a quick overview of your code flow. But the limitations are there of the specialized tools you either can trace system calls or you can trace a function calls or you can trace the kernel. And there is often limited filtering. It means that you quickly get too much information. And when you try to grab it out and your system is really busy you might actually end up changing the system you are observing. With profiling the nice thing is that it also runs while your system is running. And it samples and that means you get statistics about what is going on. And you often can see system wide what is going on. Whether it is in a library, in your kernel, in your program. But it is often limited to time sampling. It would be nice if we could sample every interesting event. And often it is a large data dump you profile for a while and then you have all this data which you then analyze after the fact. Although there are really nice after the fact analyzers for the various profiling tools. And with debugging we have the full context of what we are observing. Often you can access the variables, the parameters of the functions. You can look in memory, record search, you can get a back trace. You even have conditional break points so you could actually run a program and only stop it if some context is interesting. The limitation is again it stops the program under inspection. Which is nice if you want to sit down and look what is really going on. But it is not so nice if a problem only occurs on your production system and people are actually using that system. You rather just look at what is going on and not stop a server. And with debugging again it is one program at a time, not system wide. So with system tab we kind of made a mix. It is an obstructive, difficult word. Nonstop so you can look at your system if you don't hold your system or allow it to hold up your system too long. It should be system wide. You can monitor multiple events. But synchronous, this happens now, a function call, is exited and asynchronous based on time. And it is scriptable. So you can do in place filtering so you don't have to filter after the fact. And you can collect statistics in place. It is not super powerful but powerful enough. And what we wanted was to make all this safe. So enforce that it is nonstop and obtrusive, doesn't use too much memory. System tab will allow you to monitor things and then if it notices your monitoring takes too long, it let the system run. You can overwrite that but by default you are not allowed to do anything dangerous, use too much memory, use too much time, that kind of stuff. So how does it look? It is a pretty simple setup. Basically you have an event and you have a handler for that event. And that event can be something like a function entry for example in the kernel or it could be a specific statement in a process that get passed or it could be a timer that triggers the begin and end of a script is an event. We did that. So in the begin event you can set something up and in the end event you can create a report from all the statistics you collected. And there is this idea of aliases so you can add something to an alias multiple interesting events as if it is one. There will be an example of that. And a handler can do simple filtering on some conditionals in the context of the probe and then you can immediately say, oh, don't do that. I'm done. And there are simple control structures for each is for the, you have associate areas and statistical variables and you can loop over those. Of course, there are limits that see, oh, it did too many steps. It is limiting the whole system too much. And that is kind of the default. We always want that there is no noticeable impact on the system. And that is kind of the default unless you explicitly specify, yes, I want to use megabytes of memory and I know what I'm doing. Yes, sure. And there are three kinds of variables. So you have primitives, numbers, strings, you have associate, error, and statistical aggregates where you can store values and you can get the count, the sum, the average, the maximum, the minimum out of it. And of course, there are some helper functions for actually logging something, printing, getting the time, what is the current process, that kind of stuff. Well I showed some examples. This is the simplest, you invoke system tab as step. Either you give a script on the comment line or you have a script in a file and you can specify a target process if you want. Which actually means while this process is running, this script should run. And to see what kinds of events there are, you have the step minus L list and you can give it a pattern of events you want to see. And if you use minus capital L, you actually get all the context variables also. So I'll show some examples. And then I wanted to do it without any tricks and just have Fedora 12 installed. And Fedora 12 is really nice because for a couple of contexts, it really helps to have debug info available and the latest Fedora has a back port of the variable tracking assignment patches for the next GCC. So the debug info is really nice. It has system tab nicely integrated, test the latest kernel. I had to do some tricks. So I did install debug info beforehand because otherwise it will be a bit long. System tab will tell you if it needs specific debug info files. So I did that. I added myself to the stop dev group so I can actually, that means I've elevated privileges so I can actually see everything. There are a couple of different groups you can put users in so they can only use some scripts and there is a imperfect mode in which you can only inspect your own programs. And I actually installed the Python and Java packages from the raw height version because the version in Fedora 12 didn't have all the probes that I wanted to show. I believe that Java packages is actually going to be updated and next week it will be in Fedora 12. So let's see. Oh, well, so. So I'm going to show that you can do some simple tracing. You can probe a process in LS, show me all functions and just log the probe point where you place the probe. And then we actually, you can run it over in LS. Functions, okay. Function, right. Yeah, sorry, the resolution is a bit. But what you can see is that it actually put probes on all the functions. This is actually a function from an included file. And nice thing is this is also how you can actually specify the probe point. So you can also specify them as they would be found in the source files. And it can do that because the debug info files have all the knowledge of how the source, the binary maps back to the source files. And to show you can also have some, let's only print the probe function and show bits, parameters, parameters. So you can see that I actually wrote this out. How do you, okay, sorry. Right. So here you see that again in a hugely big thing. Sometimes it cannot find all the context. So for example, here the print color indicator doesn't actually know what the parameters are. But for most, for most functions it can actually see what arguments were given. To actually show a bit that you can also format it nicely because this is just too much output. Here we actually use a, we probe each function call. And then we print out, we indent, and then we print out, and then we print out, we add a bit of indent to the current thread. And on each return we decrease the indent and we show arrows if we're going into a function or comeback. If we probe and trace in this way, you can actually see, ha. Now I can show you that it's also system-wide. Because I forgot to actually give it a target process, but I can actually run LS here. Ha. And it saw the LS running on my system. The thread indent function also prints the number of milliseconds, the process name, and the thread ID. And here you can see that, well, you can use system-tap as a fancy L trace, but you can actually nicely format and filter all the results. And, no, let's skip that one for a minute because we're running out. This is all process-based, but you can also, I don't know if you can trace on probe. Cyscal open, show all the files that are being opened on the system at this time. So here we are doing system-wiped probing. We probe, whoa, okay. You can see that Gdome settings is opening prog mounts a lot. And Centmil opens prog load efforts every second. Interesting. What did I want to say? Oh, I just wanted to show that you can do system-wide tracing and get output. The problem here, of course, is you get quickly too much output who is really doing something. Screen is a bit small. But the idea of this script is that you have some globals which are actually associate arrays. And for every virtual file system read you, for each executable, you add the count of what they were reading and writing. And then you see a timer-based probe, which we actually don't use as a probe, but now as outputting all the data we collected. So we go to the names in the writes and in the reads, we create totals and then we print everything and then we clean up. Oh yeah, five seconds. So, yeah, luckily the thing happened what I wanted to show is that even now, but this is a very quiet system, you already see that by outputting all this you can already have step I'll be on top of what you're tracing. To help a bit with that kind of anomalies you can also use statistical variables where you every time put in the next number and what it will actually do if, since we don't actually print out this global step will, you know, you can also use statistical variables where you every time put in the next number and what it will actually do if, since we don't actually print out this global step will actually run this program and when we quit it, I hope, yes, it actually prints out for everything at monitors. The clock at blood, it did two reads, the minimum read of a thousand bytes, the maximum is the same, the sum of course then two thousand on average it did a read of a thousand bytes. Okay, so what I showed is with system tap you can do fancy scripting of all these events at least, but we need something to observe and for that we need events that we want to observe and we need to have some context and most of what I showed was very low level and I think we nicely covered the low level stuff. You can have timers, you can monitor processes, signals, function calls, you can monitor in the kernel and we're working on data watches that you can actually watch, data being changed and we hope to have events for the performance counters in your CPUs and especially for, especially for the C, C++ based programs, the debugging was pretty good these days, so GC provides us a lot of context. The trick is a bit to get all these techniques also used by other tools. We actually hope that all the tracing profiling and debugging tools also use some of what we're using because then you get more people hacking on more of these low level events, but what we really would like is more high level events. So we have two systems for that. So you can write depth sets, that's basically the aliases. So we have a depth set for the virtual file system and it defines probes and it aliases those to the correct kernel functions and it knows which arguments contain what and then sets up the context for you. And that's really nice, but what we would really like is that kind of context, not that we go through the kernel or a program or something you want to observe and then somebody goes through all the source code, but we would like them in the source code where the developers write their code and they say this might be an interesting point to observe. This is an event and here's some context for an event that might be interesting if you want to know what my program is doing. And this is where you can help. You write programs, so. In the kernel, we're making nice progress. There are actually two solutions. Well, to be honest, the kernel markers are being deprecated and the trace points are the new, the new, the new, the new, the way to do it. System tab can observe them both. And of course there are a lot of tab sets already for the kernel. For user space, we're trying to do a trick. There's detrace, which is kind of similar in spirit to a system tab. And fortunately, it's not really suitable for new Linux systems. Well, that's basically just because the license is completely incompatible in, oh well. But they had a great idea. They have user static dynamic probes, which is markers which you put in your program to indicate here is something interesting. Actually, we had the same idea and we did it slightly different and then we thought, well, that is actually not nice because then we make people choose, do you support detrace or do you support system tab? So we're actually discursing people from using the step probe indicators and encourage people to put detrace probe markers in their source code. And the implementation is completely different. But if you compile your program on a Slar system, it will make those events visible to detrace. And if you compile on a new Linux system, system tab will see them. And we even provide a fake detrace script that kind of does the same things in your build. So there are actually a couple of programs like Postgres, which already have those. And if you configure with enable detrace, those pro points get put in and you can see them through system tab. Where should you put the data? Use your pro points. What are interesting events? And most of the time that is when you would add a verbose flag and say, this is happening now. In those cases, it's often nice to also have a static pro point there because with a verbose flag or if you have a verbose flag if you use logging, you are controlling what the user sees. And often they know better. They know they want to see something in one run or on the whole system and they only need some context. One interesting thing was when we implemented system tab itself, we also had a problem. So, of course, we have minus fee and it parses the scripts and compiles the scripts and so it has a couple of parses. And I was actually adding the verbose output. Oh, I actually want to know how much memory this is using. Oh, I print out the memory. And the timings. And then suddenly it was, wait, but I have a better tool for this. Because if you have these pro points, then we actually provide functions like the memory the process is using at that time or the time and task spent. And so, it's a little bit more we actually provide the actual values, not just the strings. So, people don't have to parse out your minus verbose output. And what was funny was that we added probes to each pass. So, you can see, stop, start phase one, the compile phase. And you can see, it entered phase. And you can see it start and stop the parsing phase. And what is interesting is that now we can actually run this verbose output system-wide. We just run our test suit. And because we have aggregates, we have statistics, you can actually run the test suit, have a little script, monitor all the stop pauses, and then say, on average, pass three takes the most memory. And the argument at that time was this particular script. Am I going to fast? Well, no. I'm on time. Okay. I'm going. A little too fast. That just means you get to ask more questions. So, another way to use this is to make things that aren't C or C++ or the kernel or like scripting languages or a language runtime more visible. Actually, now I have time for a little story. It was kind of interesting because we really wanted these detrace probes to be completely compatible. So, people could just put those probes in their programs and detrace would see it and system that would see it and the whole world would be happy. What was funny was that then we tried to see how Python was instrumented. And one of the sad things is that detrace is actually forked in an Apple macOS version of detrace and a Sun, now oracle, I guess, or a Solaris detrace. And neither of them actually submitted their pro points upstream. So, we actually tried very hard to be compatible and now we might actually submit our version upstream and then they still look slightly different. It's not a big difference, but the argument order and the precise pro point names are different on macOS and on Slars, which is kind of silly. Anyway, the nice thing in the Python case was that it was already kind of done. We just had to pick either the Slars or the Apple version. And indeed when you compile Python this way you get a couple of probes, but the function entry and the function return are the most interesting ones. Again, we do some thread indenting to show the flow. And the argument names are not that nice, but I believe argument two is the function name, argument one is the file name, and argument three is the source line. Yeah, that must be it. Well, we can just run it. But again, that doesn't really matter. So, we can just run it and, yep, I didn't even do anything in Python yet. But as you can see, now you can actually see inside your Python program. You can actually see what functions are entered when, where they were defined in the source code. And so, one thing we don't have for Python, but we do have for Java is back traces, that's sometimes a bit more work. And then, again, we can run it. But this means that you can now kind of instrument and look into your Python program as if it was a normal C program. Well, at least according to system data. I probably should show that we can do the same with Java programs now. Java was also nice because at least in the case of hotspot, which was the Sun VM, which is now in the GPL, they actually spent a lot of time to instrument all the interesting parts. So, you can now see when your Java program starts to garbage collect or when it loads a class or when it unloads a class loader. And the nice thing in that case is that you can also combine it with your other probe. So, you can see, hey, Ben was some swapping out of memory going on in the kernel does that correspond to my garbage collection. And the nice thing is that maybe, I just do one example. Yeah. So, in this case, for example, what's nice is that since, since we're in a system that can look through the whole stack, you can get an interesting, an interesting backtrace where you actually see. So, what I actually did was when the GNI Java native interface get argument length is called give me a full backtrace of this point. And what you can see is that it can not only show Java functions, method calls, but it also knows whether it's jitted. Well, this was just an hello world program. So, probably nothing is compiled, but it knows whether it is interpreted. It can see some parts of the, everything is interpreted. That's not fun. Otherwise, it could see some of the compiler work, the actual hotspot methods, like it at the start, where you can see the setup, the setup, till you actually hit your Java main program. Okay. So, we take some questions. Wait. I must at least give you these two URLs. Oh, my examples were a bit simple because I didn't have that much time to explain everything, but there's a really nice examples page and there is a beginner's guide, which explains a lot about the examples. And there's also a wiki where you can learn how to precisely put user space probes in your application. Yes. Can we use this or? Yeah, can you tell us a bit more about the performance impact of doing a system tap on kernel calls, just calls? So, we try to keep the overhead really low. I'm not sure what our thresholds precisely are, but every probe is executed in place. So, there are no context switches. So, in principle, all the logging is just dumped in a buffer, which is read later. So, the impact is pretty small for most things. So, is there the ability to define a priority in terms of repassation of resources between the monitoring activity and the production activity? Sorry. Is there the ability? If you take the example of SNMP protocol, SNMP protocol on routers and things like this is always defined as the lowest priority. And is there the ability to define a priority in terms of the right execution of the things on the system and the monitoring activity? I haven't really thought about it. SNMP is, that was what you were asking SNMP. It's a higher level thing. This is much more lower level. Any thoughts on using system taps to determine memory usage of kernel modules? So, what's this kernel module using? Why is it chewing up that much memory? You can partly do that. There are actually trace points on K-Malek, for example. So, you could trace that and look at the color of K-Malek and map it to a module. I'm not sure. There might be something like that in the examples. Probably not precisely for modules, but there is something for how much K-Malek is called from where. How much of the debug info is system tap understanding? For example, how can it print a struct? Yes. It, you can, system taps sees the debug info for the struct and you can actually dereference and get the fields of a struct. Maybe it's a little bit of a silly question, but would it be possible to implement this for Windows? I don't know. Because Windows has, of course, a lot of tracing possibilities and it's always nice to have similar interfaces. Yeah. Let's say it has been 10 years since I used Windows, so I don't know. Sorry. Thank you. Is there a library, is there a library with a programmatic interface to it? So, for example, could Python look at the result, look at the profiling information of its own probes and expose that within Python? No. Is that being considered or is that interesting? Well, maybe something like that could be done. I believe Ruby has something like that. Has there been any discussion of adding anything like that? No, not yet. To be honest, that was one trick. The Python support was just done. Not necessarily Python, but is there an in... Hi. Is there any specific support for monitoring multiple systems and looking at relationships between what different machines are doing and how they're causing faults on each other? Sorry. I didn't completely understand. You mean monitoring other systems than the current running one? Yeah. Or monitoring distributed systems running on multiple machines, yeah. Yeah. We have a client server set up, so you could kind of spawn scripts on other machines, but it's not that much worked out yet.