 Well, let's talk about test trace, shall we? So first, some audience participation. Who knows what trace is? OK, almost everyone. And who uses it regularly? OK, almost half. And who has had some issues with its performance? OK, so yeah, there are still some people. So probably those are genuinely interested in this talk. Anyway, S-Trace, as you all know, is a Cisco tracer. And this talk focuses on one particular bit in its main page that a tracid process is running slowly. And it's considered a bug. But in fact, it's part of the way S-Trace works. Specifically, S-Trace uses P-Trace debugging subsystem for tracing. And P-Trace actually is a generic debugging interface which provides a set of commands for manipulating traces. And among these commands, requests are like reading and writing traces memory, reading and writing registers, et cetera. And almost all of these operations are performed on the stop-ed processes. And another part of the trace API, which is kind of peculiar, is the fact that while it abuses the standard Unix sign-link interface, specifically wait-beat, in order to deliver like notifications about traces event. And for this, it is used for all kinds of events, like standard stops, which are issued after P-Trace continue or P-Trace single step or P-Trace Cisco requests or some signals that the tracid has received and so on. And what S-Trace does, well, it just sits in a loop in waiting in wait-for, and upon receiving a signal from a child, it tries to figure out what child it was and what event happened. So then it actually reads the information from the tracid. Usually, it's registered, but sometimes, depending on the architect, it also memory. And then it can figure out what the Cisco is and what its arguments are. And then based on the Cisco number, the specific decoder is executed. And additionally, it performs some additional memory reads in order to properly decode the Cisco and present it to the user. So when the decoding is finished, well, the tracid is resumed. And this stops and resumes happen twice for each Cisco, which is kind of unfortunate. So well, assuming that little can be done about the way S-Trace works itself, we can try to at least optimize the way it does what it does. And actually, there is some room for optimization. Specifically, depending on the architecture, there is a different way to obtain the registered data, for example. Historically, it was Petrae's PQSR request, which allows reading word by word from specifically designated addresses, which represented traces registers. But later, for some architectures that appeared later in the Linux script, new interfaces have been implemented, which then have been backported to older architectures such as x86. And by enabling these interfaces, actually, we could speed up the S-Trace's operation to some extent. But yeah, it happened quite a while ago. Like for x86, Petrae's GetRedSet enablement was done in 2013 by Denis Vlasenko. And instead of issuing a multitude of Petrae's operations, it reduced to only one. It is not all that rosy for some other architectures. For example, for 32-bit x86 architectures, you still have to read memory. And for MIPS, you still have to read memory. And for IA64, for Ethereum, you still have to read memory in order to obtain the registered. Because for Ethereum, it's stored in registry files that is basically a region of memory. But for most architectures, it has been reduced to a single Cisco for obtaining registers on Cisco entry. The next thing is just, well, do not issue some Petrae's request that are not needed. For example, you don't need to get the Cisco number on existing, assuming that you are doing your traces state tracking correctly, which is another issue that has been solved by Petrae's Cisco info request that it reports also whether you enter the Cisco or exit. Otherwise, you can't distinguish these two states reliably. And well, it's a source of several bugs throughout history. You have, for example, an assumption that after exec, you are in trace. The first Petrae stop after, for example, exec v is Cisco exiting. But it's not always the case and not for all architectures. And well, we have hit such bugs several times and well, you still couldn't solve them reliably without while implementing Petrae's Cisco info. Same goes for like filtered processes. Well, when we know that, well, we are not interested in specific Cisco that we do not need to issue any Petrae's request on its exiting at all. It's just, well, you have to resume it and that's it. The next thing is various optimization specific to like specific Petrae's modes. As a leftover or after like get rexet and outment, separate Petrae's PQser request for EIP register has been left over and then it was patched. So after we have obtained the registers, we need to actually read data from trace. And again, historically, it has been done by Petrae's big data interface, which allows you reading traces memory word by word, which is not very quick if you want to read some large structure or some array or something like that. But in Linux 3.2, a new set of Cisco was implemented, specifically process VM read V and process write V. It was originally designed for applications like MPA-based message-pacing interface-based applications when you have several processes running on the same node, you can avoid double copying of memory via like shared memory or via pipe by like using this specific Cisco that allows you to copy messages directly between process. But well, it was also useful for S3 as it allows like issuing a single Cisco for reading all the data it needs, well, at least for specific data region. But it still was a bit cumbersome since we have separate move calls for each array item and so on. So six years later, Dmitry implemented caching for this calls to perform the seeding of the whole page of a single process VM read vehicle and that allowed speeding up array-pacing considerably. Another thing that was done is like some general optimizations. For example, it was usually assumed that, well, you don't have like a lot of processes to trace. And well, there was a simple algorithm implemented for matching PID to like Tracy control block like Trace description structure. But when you have a lot of processes, as some did, it all starts eating significant amount of CPU time and by like implementing trivial hashing by like using the lowest 10 bits, it was sped up quite significantly. Like in this artificial example, it was five times speed up. But yeah, not all the fixes are the same. And there was actually a bug that was opened by Red Hat partners that was some flat stop when Trace with follow forks option is executed on multi-thread processes. And the issue was, well, the reproducer is quite simple. Just, well, you have main process that runs, well, spawns 10 threads and each thread like starts running a cheap Cisco like get to it. And the problem is that after spawning three or four processes threads, it will just stop doing so. And it virtually hunks. And as it turns out, it was a scheduling fairness issue. And well, you can assume that it's a kernel bug and Trace shouldn't care about it. But on the other hand, it's Trace and Trace users who impacted by this bug. So the solution was proposed to collect all the traces stops in a batch and then dispatch them. It was like implemented in a month. Like it was more or less by the end of January 2009. But there were some disagreements with Trace's maintainer at the time. It was Ron Pongra. And yeah, there was some disagreement within some specific of this patch and was the way it was implemented. So after initial inclusion, it was reverted back half a year later. And the fix itself became real only. And it was like that, like almost 10 years. There was like one follow-up discussion three years later when the time has come to forward port this patch to yet another rail. But yeah, status quo was like it was carried over in a rail. And that's it. When I became maintainer of Trace, I faced the need of forward porting it again and decided that I don't want to do that more than once. So I tried to upstream it once more. And actually it took more than half a year and uncovered several bugs in Trace, like more or less like some corner cases. But still, we have added like three or four new test cases for this corner cases. And yeah, it finally has been included in Trace 5.0 that was released back in March 2019. And this patch has quite a number of results lines and co-opted by lines. And I actually noticed that I've missed at least several people when I write a document message for it. But it doesn't help much with general solveness of Trace. Like, this removal of redundant Trace requests and using of PSP process VM read where I was like speeding up stress only so slightly, like tens of persons. And, well, as some may know, there is like a famous or infamous post of Brandon Gregg about how stress is like 400 times and so on. And now I'm reciting it or making it more popular. But actually, I couldn't reproduce it. But the issue is the same. If you have some process that just spawns Trace, then how you stress becomes a major bottleneck and can slow down the process significantly, like more than 100 times at least. But in real world, you still have such a slowdown, but not as dramatic, but still. And you can see that in this example, like all the improvements of Trace over the past 10 years a lot, only gaining tens of persons of performance. So if we can't, well, speeding up as Trace anymore, we can try to approach the problem from another edge. And actually, it was approached by security researchers who uncovered several side channel attacks which slowed down Cisco entry code significantly. And as a result, it's now like three to four times slower with all these kernel patches and microcode mitigations. And actually, there was quite a fine-crafted Cisco entry routine for x86. And it's now has been thrown out in 2018 by Andy. So it's no longer the case that it is there. One aspect of this fine-crafted routine was that, well, if you have a fast pass, you're not, for example, Ptrace, then well, you have much better execution time for Cisco rather than Ptrace. So by having Ptrace itself slow down, the process significantly. So yeah, with this fancy side channel mitigation stuff, you now have only a couple of times of slowdown, which is actually in line with what other architecture has, for example, MIPS, ARM, Power, or S590. But still, it's like significantly slowed down a couple of times of times. And all this because we have to stop Trace each time for each Cisco. But well, we don't actually have to do so. There is actually another infrastructure in Linux kernel that is called SecComp for secure computing. And it allows enabling some filters for processes or Cisco filters. Originally, it was implemented for process unboxing, but then it was extended. And what is interested for S3 is that in Linux 3.5, the SecComp programs gained an ability to return to Ptrace-based Tracer by issuing specific return way. And sometime later, it was actually a result of two GSOC projects. The support for generation of such filters for traces has been implemented into S3. And it was included into S3 5.3. So now, with this SecComp BPF option, you will slow down Trace in this artificial example only by a couple of persons and not by a couple of times of times, which probably a significant improvement. Yeah, and with some more or less realistic examples, you still, like, if you, for example, trace only, in this example, I was tracing only memory-related Cisco and I gained only one and a half slowdown instead of two to three times. So that's probably a short history. And yeah. And there's not a lot of playing for future, like, in the near future, probably, like, some refinement of the existing capabilities. But in the distant future, probably, some drastic changes are possible. But it mostly related on the addition of specific kernel interfaces first. Only then you can implement it in S3 later. So that's probably it. Any questions? So before we jump to questions, I want to tell you that we have a mic for questions. So just raise your hand. I'll come to you. So, hey, I'm wondering why is not the BPF backend default yet? What are the problems? No, first, we have enabled it in release 5.3, which has been released less than half a year ago. The issue is that I personally don't feel confident enough in enabling it by default. For example, when you impose a second filter on a program, then the processes down the code for chain can't set up their own second filters, and so on. And actually, we have uncovered several minor bugs in which we are fixed in the later 5.4 release. So we just probably want to wait a bit some time before we can enable it by default, like it was with, for example, second reading, which second reading is a bad example, because it took almost six years for enabling it by default. But yeah, the issue is that we don't want to regress people. And when we enable something by default, we want to be sure that it won't regress a lot of people. I have an alternative answer for this. So second BPF, it has two limitations. The first one is that you can't do this when attaching to already existing processes. So strace-p, you can do this unless you are privileged. And strace is normally not privileged. If you are rude, you can do all that kind of fancy kernel tracing when you are just a regular user. You use strace, and you can attach a BPF filter to already running process. It's the first issue. The second issue is once you attached a filter, you can't detach it. So if you, for example, stop tracing and want the tracing to go on, you can't just detach your filter you installed. It will convert to a, the filter will work the way that instead of stopping, instead of trapping, it will just abort all this call in vacations that had to be forwarded to Tracer. And it's really unusable. Oh, my last question. OK. So assuming you have Perf and an unlimited buffer or just a way to stop the process when your buffer fills up, will it be the ultimate solution? Or are there any benefits of the current approach, right? Yeah. And actually, that's what was proposed by Steven Rosted in this patch, like pausing the Tracer upon, well, when the buffer fills. But there is actually another issue, like trying to port all the decoders to BPF. Like Perf enables rich Cisco decoders by having some BPF programs that will attach to trace points. And thus, you can perform a retrieval of specific parts of traces memory which are required by the decoders. And it puts these parts of traces memory into buffer. And then you can retrieve it from buffer in order to print it to the user. And it doesn't go well as it is with currently implemented decoders for S trace. But yeah, in theory, if you have an approach like this will be enabled and you don't have other significant issues, well, it is quite possible. And it's actually, well, the future is seen in the far distant. Because, well, BPF programs as well as quite limited as they are now and not all decoders, we can port that easily. For example, IOCTL. You can't port IOCTL to BPF. Or like some network Cisco we want to decode net link. But yeah, it's a limitation. But I mean, it is the same as with second BPF. We can implement it for most of the CIS calls and then provide special handling for the rest of it. So thank you for your questions.