 Thank you for coming. My name is Dmitry Revin. I'm the Mentor of Estrace since about 2009 and today I'll talk about modern Estrace. What is a modern Estrace? Where does the traditional Estrace and and the modern Estrace begin? Maybe traditional Estrace is something that is well known but well known to whom. What if you know the whole of Estrace? Does it mean that there is no traditional Estrace for you? Probably not. So as it getting too subjective, I decided to use a more universal definition of modern Estrace. As a Estrace that I maintain. So let's assume that all Estrace features introduced since 2009 are modern and all that are much earlier than this are traditional. But before diving into modern features, how does it look? Let me see. Oh, it works. Great. Before diving into modern features, I'd like to say a few words just to recall traditional Estrace a bit. All features of traditional Estrace, they could be split in several, like more or less a bit, into several groups. Like those features that control Estrace's output in different ways, like rather to collect and print instruction pointers, rather to collect and print timestamps and if yes, then how to print them. How to collect and print strings, I mean, the size, format, how to print them. How to control the urgency and details, how detailed the decoding of syscalls and printing of them is going to be. Whether to print signals, whether to dump data, because Estrace can dump data, which is read from or written to specified file descriptors. Also, you can redirect output to files and pipelines. There is also traditional syscall filtering. There are also means to collect and display statistics by timing spent in system calls, number of calls, number of ills, how to sort this and how to print this and so on. There's also ways to control Estrace's behavior in different ways, like whether to create portraits or attach to already existing portraits, whether to follow child processes and so on. So this is more or less about traditional Estrace. I'm not going to talk about traditional features. They are more or less known to you, I think. I hope. Yes? Okay, so modern features there, you can also split them into several groups. Like the way how to control output. You can print information associated with descriptors. You can print stack of function calls. You can control how to print named constants and flags. We have quite elaborate syscall parses. I will show at least one of them. There is more features how to filter syscalls. Now you can filter syscalls by paths, use regular expressions. There are also new classes. Also more features how to collect and print statistics. And more ways to control Estrace's behavior. We'll talk about this a bit later. There are also more ways to print, to collect and print more detailed information than Estrace does by default. And the last but not least is system call tampering. That's something completely different compared to all we know about Estrace. Let's start with something more or less traditional, which is system call filtering. This example you can see was a real problem like a year ago. As you probably know, there used to be just one syscall called open in the Linux channel. And then open at was added quite some time later. And then new architectures appeared that don't have open syscall anymore. And the last, not really last, the year before last, in GNULIP-C 2.26 they changed the implementation of open to invoke open syscall on all architectures, not just on new ones but also on traditional architectures. And what happened is that all scripts that used to filter open syscall stopped working because there are no more open syscalls involved by GNULIP-C. So people changed their scripts to trace open at. And then they found out, some of them, that the script no longer worked with older GNULIP-Cs. And if they deal with statically linking programs, that is completely unpredictable situation because you can end up with an executable or a library that invokes both open and open syscalls because of the way these different implementations ended up in this executable. So the traditional way to deal with this is just to list two system calls. But as I said, as you probably remember, not all architectures know about open syscall so this traditional approach just doesn't work on these architectures. So if you write a portable script, this is not the right way to go because just a simple listing is not portable. So what could be done here? Yeah, so we added support of regular expressions. Yeah, it was the year before last. Can you imagine this? Regular expressions exist for many years and it's so well supported in different software. And as stress is also so many years old, but yet we got support for regular expressions just recently. That sounds strange, but it's a fact. So using regular expressions, you can solve this problem of listing syscalls. You just can say write slash open and list all and fill the old syscalls that contain substring open. But as usual in the kernel, because of differences between architectures and other precarious, you can end up with not just with open and open add, but as you can see other syscalls. So people are lazy, they actually do this. I also do this, but if you get an open by handle add, for example then it's probably bad luck and you really should use a more accurate regular expression. By the way, in this example, you can see output from a tool that's called ASinfo, which stands for advanced syscall information tool. It's really a nice thing. I don't think you can get this information any other way. But yes, it's kind of part of a stress project, but it's not merged into the main line, so it's not part of any release yet, but it's going to be, I promise. Why it's not in any release? Because the common line interface is not yet stable and once it's released we'll have to provide compatibility for many years. So once the interface stabilizes it will be in the release. Why? Why would you want to filter syscalls at all? Why not to print everything and be happy about it? Well, sometimes you really don't know what you're looking for and in this case you really want to print everything and maybe even a bit more than everything because you really don't know what you're looking for. In such cases you just enable every extended tracing that stress can do. I'll talk about it a bit later. But when you know what you're looking for, for example, this is a script that looking for something particular, like a filter of some kind, then you don't want to, for unrelated output, to get into your way, maybe confuse your script, because some, like, if it handles a bit of strings, these strings can look like Alice's calls, who knows. So the idea is that if you don't know you're printing everything but if you're looking for something particular, it's better to use filtering and after all, stress works better when it traces only a subset of syscalls. And for stress it's important to work faster like that. Besides, regular expressions are more in the system call classes we introduced. In addition to traditional syscall classes listed here, we introduced system call classes for STAT family syscalls. Why for STAT family? This is just an example of one of these system call classes. Why STAT family? Because it's also quite big and quite complicated and quite different between architectures. So by using STAT family syscalls, they are old enough and big enough and complicated enough that you can trace maybe a history of the Linux kernel, as you can see on the next slide. But here, on this example, you can also see, besides this new class, you can see one of the methods of extended tracing, dash y option that brings path names associated with descriptors. So what about STAT family syscalls? It's quite complicated. First syscalls in the Linux kernel from the beginning were STAT, LSTAT and FSTAT. And then new editions of those syscalls were added and STAT, FSTAT and LSTAT were renamed to old, with old prefix. Then on 32-bit architectures, 64-bit analogues for the syscalls were added. And then STAT family syscalls were introduced. And finally, the year before last, I think, STAT X syscalls were added. It was added to replace all of them. But how long would it take to replace all of these traditional and legacy syscalls? You can assume more or less that old syscalls, the prefix old, they are more or less obsolete. Almost nowhere you can find besides tests, besides probably stress tests, you won't find anywhere else in live projects use of old syscalls. But, for example, in Glipsy, the wrapper for STAT X, it was added just in the last release, 2.28 last summer. So, before that, all applications that wanted to use STAT X, they would have to invoke it directly, like syscalls, just syscalls. And even if it's available in Glipsy, what you can do, you can't assume that this syscall is available, because it doesn't have to be always available, because the kernel may be not fresh enough. So, you have to handle a situation when it's not available. What does it mean for a trace and for a trace users? That you can't assume that everything uses these of that particular STAT family syscall. So, if you want to list all STAT syscalls, you have to use... And you can't list them, as you see. They are all different between architectures and even abysses. So, the best way to handle this, if you want to filter these syscalls, is to use classes. You probably can write a regular expression to describe this, it's much easier to use classes. While talking about extended tracing, I mentioned dash y option, and if you type it twice, it will get even more information associated with descriptors. In this example, you can see how this information is extended while these descriptors are getting more and more connected. For example, when the circuit is just traded, it's just a DCP circuit, but when it's connected, a trace can print more information, source and destination address and port, for example. Another feature that relates to a trace output is state tracing. Not really tracing, but you can ask a trace to print a stack of user function calls during system calling vacations. Like in this example, you can see the cat utility, for some reason closes yesterday out on exit. And if you don't know why this happens, if you want to check the system call to application logic, you can enable stack tracing. In this example, you can see that this close happens just from names of GPC functions. It looks like it's being called from an exit counter. Why can it do this? It has to ensure that everything it has to write is actually written, so if it isn't, then it has to exit with a non-zero exit code. In some cases, like when it writes into network descriptors, it might happen that right cscoll succeeds, but close isn't, because it isn't actually written, so it makes sense for cat to do this. But without stack tracing, all you have is to guess what's going on. By the way, in this example, you can see besides this dash k option also the way how to filter cscolls by what names they access. In this example, a trace shows only those cscolls that have something to do with slash dev slash full device. So when it is useful when you really don't know what's going on, or for example, when you think that you know what's going on, but the application is like multi-threaded and quite complicated, and when something goes wrong and you want to trace it back to application level, and such situations this can help. Another feature to control a trace output is dash x option. It's quite a recent addition. It allows you to print named constants in like numeric constants in different ways. The traditional one is to print them symbolic when they actually match some symbolic constants. But sometimes what happens is that some software, probably because GNULIP-C does implement reports for all cscolls or for portability because it now implements but uses to miss some cscolls. Some software actually implements their own wrappers and they are not very well portable. And what happens is that they sometimes confuse cscoll arguments. It shouldn't sound surprising because cscoll, even number of arguments of cscolls is not portable across architectures. And even semantics can differ. So if you suspect that an application confuses cscoll arguments, you can print both raw values and symbolic values and see the problem. Because it makes little sense to print numerics in symbolic form if the application really confuses this. For example, like in this open ad, the first argument and the third argument, they are both numeric but they have different semantics. Another application I've been told last year that a tool from cscoll project also uses a strace-xrow output to convert the race logs from strace to produce cscoll programs. Let's move on to statistics. In addition to traditional statistics that shows how much time is spent in cscolls, now you can also... I mean system time. In addition to this, now you can ask a strace to print the real-time cscolls. When it could be useful? Because some system calls sleep a lot. Not just nano-sleep, don't be confused with a prefix nano. It sleeps a lot. It's actually the main purpose of nano-sleep is to sleep, but other cscolls could sleep, for example, when they're waiting for input-output completion. And in some cases the bottleneck is not the number of cscoll invocations, but in the amount of time they sleep. For example, it's not a really rare case when delays have been added to an application to work around some problems and then maybe a few days later either problems disappeared or even people disappeared. It happens, but those delays they left forgotten. And yet a few years later you end up with a program that does something you can't get. So in such cases you can use this statistic gathering to uncover cases like this. Yet another option to console a stress behavior. As you know, a stress is not very transparent to programs with traces. For example, when a stress runs a program, it's also not just it's racial but also it's parent. So as you can see in this example that a stress is not the program that invoked a stress. And sometimes it's not desirable. And there is a way to make a stress more transparent with just using the HD option. You can make a stress go to the background almost immediately. As you can see in this example, the parent process of the tracer is the process that invoked a stress, not a stress itself. When it could be use of full, there are actually many situations when you want to make a stress more transparent. For example, if you want to trace the whole container with all processes, we just invoke init under stress-dash-d and a stress will be p1 process for a very short time and then it will go to the background and replace itself, I mean p1 with a real init. Besides options we have, as I said, more elaborate parses. And this is probably the most elaborate one that handles most popular init protocols. In this very small example you can see a routing table for an almost empty network namespace where you can see there is just a loopback device with default routing. And if you run this command under stress, you'll see this. You probably guess the pattern. These four lines you can guess them. You can go back. You can just go forward. So, yeah, it's quite elaborated. And netlink route is not the most complicated of init protocols. There is also nethinter, which is even more complicated. And stress now can decode this. There is also a small script, utility intended to aggregate trace logs when it could be useful. When you are, for example, sometimes it's useful to keep logs of different processes in spread files. This way you can get rid of this interrupted resumed markers. And sometimes it's easier to handle them by streets. But when you want to look at the result with your own eyes and when you're looking for some pattern you don't know what you're looking for, then it's useful to aggregate these traces back. And the best way to aggregate them is to use the program that already does this properly. It's a part of a stress project, so it's covered by tests. It's much better to use a program that's already tested than to write your own. And, you know, you'll have a lot of opportunities to write your own mistakes. There is no need to repeat mistakes made by others. Okay, so the last, but not least, is the new feature that's quite revolutionary for a trace, because a trace used to be a tracing tool. And now with additional system called tampering, now it's something completely different. I mean, it can trace as it used to trace before, but it also can tamper with processes. The first method of tampering we added, it was a system called fault injection. Like, fault injection in general, it's a well-known tool for testers. It's used to simulate situations that are not easy to reproduce in reality. And in case of a trace, when you can inject faults into syscalls, it's really, really handy and really easy to do. And this way, a trace became a testing tool. Actually, we use syscall fault injection to test some parts of a trace itself. Yeah, why not? Because some code paths are not easy to trigger, like error paths. And also some decoders only work when the process is privileged. And when you want to test without too many privileges, it's also a good opportunity. So, yeah, this is just a simple example. But besides testing, fault injection could be used to actively look for bugs. These are quite old examples, as you can see, Python 3.5 is a history nowadays. But when this feature was written during, yeah, it was, okay. So it was just a prototype of the feature, but the author of the prototype. He was looking for bugs in different projects. And he found this pretty one that Python 3 used to ignore return codes of file operations with slash dev slash your random device. And it led to the situation that it accessed, as you can see, hexadecimal address 50. It's probably some method from an object that's mapped at the null address. So just when it ignored all these errors, it ended up with a crash. Fortunately, the bug was fixed in Python 3.6, I think. But it's funny. Another bug that was found using this fault injection feature. It was a bug in Glypsy dynamic linker. It's usually good in checking return codes, but in one spot it ignored return code from MProtex's call. And the result was that some region of memory that expected to be made inaccessible. So it probably could be considered as a bug with security implications. So, yeah, I found it, I fixed it. You shouldn't think about it anymore. But, yeah, this way you can use fault injection to look for bugs. After, it's called fault injection. Other kinds of injections were added that like return value injection, not just necessarily an error. You probably don't want to inject return values to any kind of system. Because how it works, like in case of fault injection, it cancels the system itself. And on existing system call, it injects the error value. And in case of return value injection, it also cancels the system call and injects something that is specified. So if a system call is expected to write something into user memory, it's probably not a good idea to inject return value because nothing is going to be written. But for those system calls that don't write into user memory, for example just return zero on success, you can do this. This example is actually modeled on a real case. We had to debug an application that wrote into many temporary files and then pass these temporary files on to other applications and then remove them. And some bug kept in and some of these temporary files got something it shouldn't get. And the easiest way to analyze this was to cancel removal of these temporary files. It's a really simple thing to do. The application was unaware of this trick we did with it and we could analyze these temporary files almost for free. Yeah, and the bug was fixed, fortunately. And yet another kind of injection, it's probably the last one. I'm going to talk about is delay injection. Why would you want to inject delays? Isn't it slowing down everything? Enough already? Well, the key word here, like in other kinds of injections, is targeted. It's a targeted delay injection that's useful, like in Cisco injection, return value injection. But it could be useful. Some programs think, I mean, they are authors probably think that if they write a sleep for a specified time and they expect that it will sleep this amount of time. So they code in some expectations from a Linux kernel that is not guaranteed to provide, for example, in this case for sleep. If Cisco is asked to sleep for one second, it will sleep at least one second. That's true. But in the modern world with all these virtual machines and funny, funny schedules, you can get into like one second and a half. I've seen even three seconds in OBS where we also, we use OBS to test the trace, like in other environments. And so you can't really expect that a specified amount of sleep in a Cisco will take exactly this time. And some applications actually do this. And this is the way to replicate the situation in a reliable way. So it can be used for regression testing like Cisco fault injection. Beside this, you can use this to, for example, to slow down some operations. For example, to slow down some specific network operations. Imagine if you have an application that can generate a lot of traffic but doesn't have its own means to limit it. So you can, using filtering, you can target delays of very specific schools that generate output and this way you can slow down it as much as you like. And this example is really artificial because you don't normally have to slow down any write operations to DevNull. DevNull is, it cannot set any data as fast as you can write there. But this is a way to demonstrate the feature. As you can see in this example, that the measure of slow down by DD itself is like from 5 gigabytes, 5, yeah. So it's about three orders of magnitude slower than when in a normal situation. So this is really more or less, I would like to talk about modern features today. As you can see, Strace is not just a very old project, really old. It's probably a few months older than a Linux kernel. But still, like a Linux kernel, it's quite a live project still. And we are still adding new features and more features are coming. So now I'm ready for your questions and suggestions. Do I have to repeat questions? I'll do my best, but if I'm forgetting to repeat them, please remind me. Yes? So on an earlier slide, you had a couple of them. Could you back to slide 25? I don't know, brother. Yeah, I can do this. In this one invocation of Strace, you have a slash unlink. What is the slash? It's a regular expression. Yeah, it's the regular expression feature because it could be unlinked or unlinked. And I don't want to bother about this anymore. So the question was about slash unlinked. They remind me of sounding questions in reverse notation. I'll first answer them and then I'll sum them up. So the classic method is p-trace. So the question was whether we still use p-trace interface or do we have BPF-based methods? So far we use p-trace. There was a GSOC project last year to use SICOMP, which is also p-trace, but you can save like half of context switches by avoiding or even more than by avoiding those schools that you're not interested in. So it's still p-trace, but this computing you can save some space. With regards to BPFs, it's probably our future, but not the nearest one. Well, BPF is a root-only thing. It's a first. There are serious, very strict limitations on the size of programs you can do. And imagine how much space would it take to produce this. Probably you would have to fetch raw data and then process it in the user space. So it's still not really under development, but we're still thinking about this. It's a way to go, but the road is unclear. Any more questions? If you combine minus p with an injection, I think I bet you can. Any more questions? The question was, can we combine dumping of data with... Sorry, with what? Could you repeat the question, sir? The question was, as I got it, is whether you can read... Combine dumping of data with filtering, which is called filtering of some kind. And the answer is yes. You can specify syscalls you want to filter and then dump data associated with descriptors you specified. So the answer is yes. If I've got the question right. Any more questions? We still have a few minutes, I think. Yes, please. The question is whether STRACE supports plugins that could be loaded and extend STRACE in any way. The answer is no. I don't think STRACE supports any plugins. There is no API for this. STRACE was not designed for this. So if we are going to support plugins, we would have to design an API. Yes. Are you recorded or do I have to repeat all the things you said? And for it... And for it you... With the law, you can just set every query code which is executed at the... Each syscalls location. Syscalls and everything. And that way you can just load any additional every query code and execute it with every query code. But here with the law, it's a bit experimental. It shows that... So yeah, it's... In short, it's not much. What's the name of the process? Well, it's a branch available in the STRACE's github editorium. Like, who are starting? Oh, I see, it's a branch. Yes. I would like to add to this that our writing plugins... It's a completely different level of complexity and STRACE is expected to be like easy to use. And that is like writing code not involved in the STRACE. So it's not the target audience for STRACE is to writing new extensions. So yeah, for example in case of lower scripts, it's not an easy thing to write a lower script compared to writing any command you can see in this example. Even the most complex one, like fault injection, it's the most complex one, but still it's much, much, much easier than to write a lower script. Yes, please. Can you have multiple for the same? Excuse me, could you tell it a bit louder? Could you have multiple for the same call? Multiple. For example... So the question is, could you do multiple fault injections on a single system call? Yes. The answer is probably no, because we don't issue news calls. What we do is we substitute the system call with a not a system call and then inject an error on exit. We don't change the control flow. We substitute the system call with another system call. So to inject several faults, you would have to inject more system call invocations. Sorry. Do you have the when have the same system call? I'm not sure. And we're almost out of time. So currently there is only one injection that can be configured by Cisco. But you can configure different types of invocations simultaneously. You can simultaneously do different injections and signal injections. But otherwise, it's a limitation. And it can be... And we can put it another. So the project that is not managed yet, the Tavaos advanced Cisco feedback. There you basically can specify multiple transactions for different subsets of Cisco invocations. So, thank you very much.