 That's the exercise for the end of the talk. So today I'm going to talk about the Flight Recorder for C and C++ programs. The idea is to have instrumentation in your programs that is always there and therefore is there when you need it. So what is a Flight Recorder? Those of you who had a program where they crashed, please raise your hand. Those of you who knew why it was crashing. Okay, so now you have to think, what was I doing right before that? Why is it crashing now? Some idiot hid any key at the wrong time. I love this quote. Programming today is a race between software engineers striving to build bigger and better idiot-proof programs. And the universe trying to produce bigger and better idiots. So far, the universe is winning. So when you have a crash like this, you know, you wish you could reproduce, but of course when it's really tricky, usually you can't. And these kind of bugs are called hyzen bugs, because those are the bugs that only happen when you don't look at them. So the other thing is that sometimes in order to reproduce the problem, you need someone to actually do the work for you. And so you need some kind of guy, you know, they're usually called users or something like that. And you need them to do something for you. And, you know, this picture shows you how programmers tend to see their users. Now, another really nasty effect is when the problem seems to go away by itself. You know, it's magic. Just don't look at it. It doesn't happen anymore. Close the bug. So if your bugzilla was not touched in two years, you can close it safely. It's a rule. You know, it's basically, it's okay. So this piles up in a, you know, we have this unresolved mysteries part 177 that we need to care. So the problem there is that you're making someone happy. Do you want to make your customer unhappy? No, of course not. So you need a way to address that. So what does a flat recorder bring to the table? The idea is to collect information before the problem happens. So you recall everything while the program is running. And you print it only when you need it. A good example is you get a signal, you know, your program crashes, has a signal 11 or something, and then you print at that time. But you print the stuff that happened before the crash. So that means you record continuously. In order to record continuously, you need to have instrumentation that is always there. It's not like you switch it off. It's not something that you toggle in or you, not something you if deaf in. It's always there. And something that is always there has a magic property. It's there when you need it. Now, once you have this instrumentation in place, you can do more, and that's the nice trick about it. For example, you can use tracing using the same instrumentation. So you can basically activate specific traces and say, I want to see that happen. Instead of having all the printers of your program come at once, you can get only those that you're interested in. And you can do continuous measurements and graphing. So I'm going to demo that later. The end point there is that, or the bottom line, is that you have one instrumentation to put in place and multiple users. Okay, so what is this thing? How does it work? Basically, it's simple circular buffers, and you record data in these circular buffers continuously. The new thing, for me at least, is that it has multiple buffers that can record at different speeds. A new thing and looking at Karen compared to what she knows from a previous life. So the benefit of doing this that way is that you can preserve old and important records. For instance, you can, at the beginning of your program, record printflex statements for all the input arguments of your program, like common line parameters, options that you get from environment variables, all that kind of stuff. And if you don't have a write it, it stays there. If you don't have a crash, you get that at the beginning of the dump. At the same time, the most recent events, for instance, you have an interrupt every microsecond. You can still get all the data about the last 128 interrupts at the same time as the stuff that happened at the beginning of the program. Because they are in separate buffers, so they don't overwrite each other. Okay, how do you use it? It's fairly easy to install. You just clone as a sub-module. At some point, I'm going to make a package like Promise, that myself, for the best of the past year. But for now, you just have to clone a sub-module and add two files to your source files. Recorder.c is the Recorder API itself, and Recorder.c is a lockless ring buffer implementation. And it's fairly easy to use. It works mostly like printf, except you use Recorder instead of printf. Most of you might be able to manage the difference. And you have to put the name of the Recorder, your stuff things in as the first argument. So, standard printer format. No backslashing at the end, just because I'm lazy, and I always forget them, so you don't need to put them. So that makes it fairly easy to pull existing codes. For instance, this is the Spice code, the remote viewer for virtual machines. And in that Spice code, there was a Spice info macro. The only thing I needed to do was to replace whatever lock function was there, like printf or something like that, with a record, and I connect to this mechanism automatically. So that makes it fairly easy to use. Okay, there is one caveat, though, is that the Recorder dump happens late. And I don't want to pay the price of formatting, so what happens is that the data is recorded, but not formatted at the time you record it. In other words, when you do a record like this, you end up with an entry, some structure, that has a timestamp, a source file information, an order, which is the order in which the things happen, so that you can sync events that came from multiple CPUs, the format string, and then up to four arguments per entry. So you can see, for instance, there that the %s string will be recorded as a pointer, as in C, and so this is not safe to print by the time you dump. So only the string is stored, the string is not copied, only the pointer is stored. So since things happen at dump time, the formatting happens at dump time, this means that in that case something bad happens, the dump function will only print the pointer, so the value of str will be some pointer, because it's not safe at this point to reference the pointer. Now, in some cases you know that something is a constant string, like in this case, for instance, I know that it's a built-in constant, so it's safe to print in that case, and so there is this special format of the %s, you add the plus option, and that tells the recorder you can print that. Keylet number two is that the recorder, as it is written today, has macros for up to 12 arguments. There are four elements for record entries, and you can have up to three entries. It's not a limitation, a hard limitation, it's just that basically the macros to do that look like this. The macros are somewhat complicated to write. You're welcome to submit a pass to go up to 16 if you need it. Just be ready to write this kind of crap. So how do you define or declare a recorder? The idea is that since this is instrumentation that is always in place, it better be self-documented so that folks who use the program have a way to know what each recorder represents. So you have this recorder macro that defines a recorder. It gives it a name and a size. The name is in its own namespace, so you don't risk clashing with existing variable names. You can keep them short. And the size is the number of entries you want to keep. It better be a power of two for performance. Not a big deal, but it's better that way. And the last argument here is a description that shows if you pass the help option to the recorder saying, please show me what recorders you have in this program. So that's how it shows on the outputs. The recorder declare macro is used to declare a recorder in a header, so much like a function. You have the prototype so that you can use it. And you have the actual implementation somewhere in a 1c file. So use recorder declare in header files. Use recorder or recorder define in the implementation files. Okay, so how do you dump the recorder? You use the recorder function and what it does is basically sort all the entries and give them in the order in which they were recorded. Again, this is thread safe, so you can have multiple CPUs banging at this thing at the same time. There is a variant that lets you select which recorder you want to dump, so you may have a condition where this is a specific kind of error and you know that when this happens, what I'm interested in is this or that data and then you can basically select it. You have this recorder dump and you pass away all the things you are interested in seeing. You have callbacks to format the output of the recorder dump any way you want. Typically you would want to send that, for instance, to a standard logger, to standard output, whatever. You decide whether you want that to go to a file or anything else. And there are callbacks to format to record entries because I prefer to have date and time, date and file name, et cetera, and maybe you prefer to have it the other way, your call. There is a function that is really helpful which is recorded dump on common signals and what this does is basically add a recorder dump for each of the signals of the standard crashing signals, like 811 and so on. So the arguments are bitmask of signal to enable and disable and if you put 0, 0, it's basically a default set. Okay, so how does tracing work? You have this environment variable called RecolorTraces which is read automatically if you call RecolorDump on common signals. And so you execute your program saying RecolorTraces equals the name of the recorder, regex of the name of the recorder you want to see and you can separate multiple ones with column or space. And as I said, you can use regex. Now RecolorTraces is read automatically but some applications may have specific requirements for how they dump their debug outputs. So for instance, Spice has their own way of doing debugging and so in that case I added some special code that reads spiced traces and in that case formats things, changes the formatting output so that it looks good for the Spice team. Now the executed record statements for all the active recorders, so the other recorders that you marked as I want to trace this will print right away. In that case, of course, you get the strings correctly in all cases. And you can programmatically configure which ones are active using RecolorTraceSet which typically you would do, for instance, to enable all the things that end in warning, error, etc. And then you can have naming conventions in your recorders to make sure that anything that would be something warning would show up right away. Now another interesting thing when you debug programs that are running more or less in real time is that sometimes you want to tweak what happens in them. And so the Recolor also has the possibility to define what I call tweaks which are basically variables that you can access from the outside so that you can change the program behavior. The demo is probably going to explain much better than just explaining it like this. But here's an example of code. So we have this code that depends on some sort of maximum queue length and I want to emit a message when we reach a given queue length. I want this queue length to be configurable and I define a tweak or trace and I say, okay, when I hit more than 300, for instance, then show me more details. So you use the macro RecolorTrace given a name or RecolorTweak given a name. It's really value-assisted to the same thing and it lets you show information. On the common line you would use the name equals value giving a numerical value to set the value of the tweaks. And if you don't give a name, it's setting the value to 1. If you set it to 0, you can disable a trace. So for instance, you have a warning that is really annoying you. You can say this warning equals 0 and you don't see it anymore. And so I said tweak and trace are equivalent because basically since I needed this flag to decide whether I was dumping or I was tracing or not, I decided I might as well give it a full word and let you use it for whatever reason you want. So that's why tweak and trace are aliases. It's really the same thing. Okay, so I said we can do some graphing in real time. So basically the idea is that the Recolor entries, if they are stable, if your prints are organized in some sort of column, so the data in the first column is representing the same thing, then you can use that, export it as it goes and you export that in shell memory and you can have another program that picks it up and draws it. And so now you can see what happens in your program in real time. So we'll see that in a moment. So let's say for instance that I have a size of stats recorder somewhere. Then I can have different record statements like initial size is this, the length is that, second size is this, the length is that. What matters here is that both here have a person F as the first argument and a person U as the second argument. So they have the same type. That's the first thing that is important. And the second thing is that they have to represent the same data otherwise your graph is going to jump all over the place. Makes sense so far? So in order to be able to graph something like this, you need to export it first. So it's much like tracing except it's tracing in order to see it in real time graphs. And in that case you give them names. So you give the name of the recolor equals and the name of the columns. So that way you can have, you can give arbitrary names to a given session to see what happens inside your code. Okay, so time for a demo that will clarify a lot of this. So I have this little test application and in it it has a recolor code speed test that is defined. It has two tweaks called sleep time and sleep time delta. And it has this thread here that loops forever. And for each speed test it records, as you can see, four values, the thread ID, the number of recordings it made in this thread, a module of 500 of the above. And the time it took between two record statements. So let me start this speed test. So let me start with, for instance, eight threads and four 10 seconds. And so basically the records are happening but you see nothing outside. Okay, 10 seconds was a bit long. You can see that each of the records takes 43 nanoseconds on this system. And when it's done, it's selectively dumping, for instance, some stuff that happened at the beginning. So if you look at the record of dump here, you can see that you have stuff that happened very early on. This is a record order. So this happened at the very beginning. And then we have information that happened in main. So this happened pre-main. This was basically constructors. Stuff that happens in main there. So you see that it launched eight recorders. And then you have an example where you don't see the string showing up. And then you can see that here the order jumped from 33 to 1027. So you know basically that there's a lot of data missing here, presumably stuff that was recorded, but then discarded because it's a circle of buffer that was not large enough. And you see that not much time happened. It's basically a few microseconds between the two. And then you move forward still a little bit. Now you move forward 10 seconds. And your order is something like 227 million. So basically in 10 seconds it's recorded 227 million records, most of which were discarded. And then when you decide to actually dump and selectively dump some of it, you get the latest. And you see that, for instance, thread four was recording. It was its recording number that much. Modular is 64. It took zero microsecond compared to the previous one, et cetera, et cetera. So this is the stuff that happened. Okay. So now let me do the same thing. But now I'm going to do it with recorded traces equals speed test. So what I'm doing now is I'm activating the tracing for the speed test entry, which is the main one. And now as it goes, okay. Now you see one of the main benefits of this approach. Now I'm printing as I go. And you can see that my prints are after 75. So I went from zero microsecond to 75. When you just print stuff, it's costly. It's expensive. So by not doing it ahead of time and basically setting time. Now for debugging, it might be okay. But for performance, you don't want to do that. You don't want to log in a file. You don't want to do that stuff. It's expensive. Okay. So next thing is now I'm going to do something where the speed test you see has four columns. So now I'm going to say that the first is the TID. The second is the recorder ID. I don't care about it. Then there is a module and then there is a delay. That's just arbitrary names that I give to my four columns. And now again, nothing shows up. Execution again happens at full speed. But now let me just put 1000 seconds so that I have time to do the next thing. Now I can start a recorder scope here. And I'm going to say that I want to watch the delay. Okay. So this is the actual delay that is happening. So I said it was working full speed. And you can see it's not completely full speed because there are some outliers. And that's basically when the program is posted. What you see there is the effect of scheduling. Most of the time it's close to zero and sometimes it's much higher. So I'm going to actually force it to delay. And I'm going now to use the dash C option of the recorder scope to set the tweak that I defined here, the sleep time. And I'm going to set that tweak to 15 milliseconds, for instance. And in that magic now my... So what's happening is now I'm executing this code. Record the tweak. See that there is a sleep time and does another sleep of 15 milliseconds every time. Okay. So that's a bit slow. Let me adjust that. Oh, yeah. I thought that was too slow. That's because I used 150,000. So that's 15 milliseconds. And now you can see the actual scheduling time. So you're observing your scheduler at work. So you were asking for a delay of 15 milliseconds. And what you actually see is between 15 and 18 something. So now I'm going to introduce you to the next talk about the build system. So I'm going basically to do a parallel build at the same time. We're going to see in real time the effect on the scheduling in my system. And you see that you have a few more outliers now. And we're putting some pressure on the system. So it's slightly less stable. But it's pretty good, right? So modern schedulers know how to do that stuff. And now when it's starting to build all the stuff in parallel, which is coming soon, then you're going to see more pressure. And from now you see the pressure, you see the ticks. But sometimes your problem is delayed. OK. So as you can guess, I really love implementing stuff that has absolutely no purpose in life, right? So let me do, now I use dash s. So dash c is for command. I can send basically arbitrary commands to my application so that if I send help, for instance, it's going to dump the help for the recorder. That's basically using sending command over shell memory. Now I use s, sleep time, between 10,000 and 100,000. So can someone guess what I'm doing here? What does s stand for? No, you won't guess. Slider. So now I have this nice slider. And I can adjust the behavior on my C program in real time. OK. Let me put that in front. Because now normally that's the time when you applaud or something like that. If you don't, I'm disappointed. OK. So it's not completely useless, actually. It has many, many purposes. So now this is macOS 10. This is not Linux. And the reason I'm doing the demo on macOS 10 is for this. Yeah, I know. Once again, I'm ahead of time. Speak too quickly, as usual. So macOS and all the BSDs have these nice features called Control-T. Who knows what Control-T does on BSD? It sends SIG info to the running program. So Control-T, demo effect. Oh, I'm in the scope. Sorry. Control-T, record the dump. It sends SIG info. So now I have what happened right now. So your program is mid-range. Control-T, boom. I really wish I had that in Linux. So I find this so useful. I decided to try to implement it in Linux, and it's really crazy to implement that stuff. I decided to give up for that. But it's really useful. So if someone wants to implement Control-T in a kernel, terminal handler, IOCTS and all that stuff, feel free. It's really useful. OK. Anything else you want me to try while I'm in demo mode? No question there? OK. Just one thing that we might see here. So let me get back there. Just to show you that this thing can be fast. Oops. OK. So this can be fast. And in that case, if I do my Control-T, what I see is that... Where is that? OK. There's a bug there. So you see that typically, now, it's slightly more expensive than before. It's about one or two microphones when it's actually exporting the data. It's slightly more expensive to export data to shell memory. But it's still pretty comfortable. OK. So I was talking performance all that time. So let me give you some actual numbers. So this is on a PC and a Xeon E5 at 360GHz. So it's pretty much a start-up PC these days. You see that record is basically in the same order as the best case Madoch or a very short mem copy, like a half a K mem copy. It turns out that record is actually spending half of its time in get time of day. Just getting the timestamp is taking half of the time of the stuff. So there's a record fast. If you have a group of records, record fast basically reuses the timestamp of the produced one. And it's twice as fast. So you're talking, you know, look at the grid here, right? The grid is now 59 milliseconds. So basically, you saw 43 on my system here. It's in the same order here. So again, comparison mem copy is in the same order. Magnitude Madoch is more expensive than that. The best case Madoch is if it picks up something from the freeze right away. It's in the same order. If it has to walk the list, make some allocations in the worst case, it's much, much worse. Madoch is a killer. Don't use Madoch. S and printf is like, you know, want to use that, right? It's so expensive. Even printing to dev.nul is expensive. Even if you don't keep the data, it's expensive. Now, if you print to file and try to keep it, God forbid, don't do that. On the Macbooks, a different operating system, MacQuestern. Recall is slightly slower. It's actually faster than that now. It's below 50. I did some optimizations since I got all the data. And you can see that MacQuest is much better than Linux at printing to dev.nul. So it's a very fancy optimization where I don't know why that it does. It waits your time optimizing for the case nobody cares about. Now, of course, most of you actually use things like Raspberry Pi or stuff like that, you know, slower machines. So even in that case, you see that your well-sub micro-sugong, well, in the range of micro-sugong kind of stuff, for a first generation Raspberry Pi, 700 megahertz. The CEO of NVIDIA has a nice way of talking about this. It's a cute machine. Okay, that's cute. 700 megahertz. Okay, and if you take latest generation Raspberry Pi 2, well, sorry, latest when I'm at home. That's the latest I have. So it's interesting because you can see that printing to a log file really depends on the speed of your file system. In that case, it's a slower SD card, so the printing is slower. But basically, it's in the same order. So the bottom line of this is you can use it. It's really inexpensive. Even in the tightest loop, it's hard to find a case where you would not want to have the instrumentation staying in place just because it would cost too much. And it doesn't take that much memory either. Each of the buffers, you know, if you take a reasonable size, you waste 4K of data space somewhere. So 4K these days is, you know, it's four time-extinct layer, nothing. Don't call it. So it's lock-free for scalability. That's the measurements with one thread, same machines as before. Now, if you have 32 threads, you can see that basically the time is... Oops, sorry. Up. Ah! All the row keys are in the same spot on this keyboard, so it's... So you can see that basically the time is almost the same, you know, for one, two, three, four, 32 threads. Now, this is actually lying. I'll explain shortly why it's lying. And you get more of the same numbers depending on the zones. That's for Raspberry Pi. And you can see that Pi 2 is much faster at talking. Much better for multi-CPU operations. Okay, so I said I was lying, and the reason I'm lying is because the cost per thread increases. This is completely memory bound. So I don't know if Russ is here. I don't know if you saw the talk from Russ this morning about NUMA performance. This is exactly the kind of stuff you don't want to do in NUMA, having multiple threads that bang on the same counter, the same circle of buffers, et cetera, because you basically effect data from one cache line to another all the time. And so what is really limiting there is the memory bandwidth. So what you're seeing there is how much time it takes to write my record entries, and that's basically it. That's all there is to it. With that, time for questions. Please, questions. I beg you. Super. I'm sure mine won't be a problem. All right, so I'll ask the irritating question. What's the license used? I think I changed it to GPLv3 recently because I was using GPLv2 and someone complained. And frankly, I could not care less. So you asked me, I'm the only copyright owner. You want something that is under the barfooted license? I'll do it for you. Okay, I think this is very cool. And I've seen something, similar kind of application libraries that are helpful for programming that are under more permissive licenses. And not something that's... Yeah, that's right, or MIT or... Yeah, or even more permissive. Give me the most permissive you have. Okay, because I think that would, I think within places that would be very cool. So what I'll do is I'll give the, find me the two most permissive licenses and I'll apply one separately on each file in there. Any other question? And by the way, the next step for this is basically packaging. First, I've never really packaged on Fedora yet, so I'll use that to learn. The next thing is, that means putting it in a shared lib. I'm still struggling a little with the initialization with C++. The problem being that in order for the recaller trace set stuff to work initially, you want that to happen at constructor time. And so that means basically that any recall statements you might have in constructors might not trace properly. They would be present later, but they would not trace properly. So I'm still trying to figure out a way to solve that problem. That's the only issue I know of at the moment. Who wants to use it? Please raise your hands. Yeah, got it. Okay, thanks a lot. And...