 Hello everybody, thank you for coming. My name is Dmitry Levin, I'm the Chief Software Architect of Bazal, and also I'm the maintainer of Astrays for more than 10 years now. I usually talk about modern features of Astrays, and today is not an exception, although it's also a history, the history of a Linux kernel design flaw. This history starts back in 2001, when the 8664 architecture was first implemented. Actually, it was Linux kernel where the first universal kernel where the support of this new architecture was added, other followed later, but yeah. So the main feature of this obviously 64-bit architecture was that it could execute both native 64-bit instructions, but also legacy 32-bit instructions. This is probably the reason why it defeated the competitor. But the way it was implemented in the kernel, it allowed not just to mix the instructions, but also to mix system calls. So you could invoke in the same process, in the same 64-bit process both native 64-bit system calls and legacy 32-bit system calls. This feature was not very widely known, not very well documented, and I was a witness of many cases where people were genuinely surprised to find this out. So what does the Linux kernel API provide to find out information about system call? For a user space tracer or debugger, you can obtain the system call number, the value of CS register that describes the bitness of the process, the address of the CPU instruction that was invoked, but there was no API to tell exactly what was the system call. Was it a 64-bit or 52-bit? So what could user space tracer and debuggers do? Then they needed to obtain the system call information. So they fetched this system call number from one register, fetched the value of CS register to find out the bitness of the process, and they did a wild guess. So if the process is 64-bit, then the CS call also has to be 64-bit. And if it's 32-bit the process, then the system call should be. Why not? It's quite natural, right? And then they, like, suppose they know the bitness of system call, they fetch system call agreements according to this, or using the method appropriate to this bitness, and they were working this way. Later, a slightly more fast method of obtaining registers was introduced in the Linux kernel, so tracer started fetching the whole register set, but the register set sent by the kernel depends on the bitness of process. So from a system call bitness perspective, nothing changed. And there is no surprise that it depends on the process bitness, because the debuggers may need to fetch registers for all kinds of purposes, not just for system calls. And strictly speaking, there is no direct link between process bitness and system call bitness, except that usually they match, very often they match, but sometimes they don't. So what happens when they don't match? First time, first time known to me when this bug was reported, it was reported against the strays to Debian bug tracker back in 2008. So there was a very simple reproducer. So here you can see a very similar reproducer that is essentially the same as one you can find in the bug report. So what it essentially does, it just prints a string, then invokes this 32-bit system call using in hexadecimal 80, and then prints another string. This funny-looking system call is actually a fork. As you can see, if you remember this x86ABI, the number of system calls stored in EAX register, it's number two, number two in the system call table of x86A is fork. So you can compile this program, run this program, and we'll see something like this. So there will be a line printed, and then two lines printed by both processes. It will look like, I think process IDs will be different, but all the rest will look like this. But if you run this program under strays, you'll probably be surprised, because you will see something really odd. Like you see the line is printed all right, then some processes attach it, and then you like see that. You see the very odd-looking open system call with ridiculous arguments, and impossible return code. All subsequent system calls look as usual, making the whole picture very odd. Yeah, you actually can run this program several times, and every time, you will likely see different set of impossible combinations of open flags. So for me, it resembles a toy I had in my childhood, a kaleidoscope. You turn it slightly, and you get every time a new nice or odd-looking picture. The reason for this is kernel address indemnization. All these strange flags, they are actually garbage in registers. And this garbage depends on what's left from previous system calls, and they depend on randomized addresses and such, so it was not the case at the time of Bargaport, but nowadays you can use this as a nice kaleidoscope. So what are the alternatives? What could user space addresses do? There is an alternative method of obtaining a system call information, so you can, you know the address of an instruction pointer. You can step two bytes back, and fetch from that address supposedly the instruction that was used to invoke that system call. And then decide on the code, what was the business of system call, and so on. What's the problem? When you are fetching something from memory, there's inherent race condition. I think so. May I continue? Sorry. You are too loud. Sorry. You don't have to. They are downloaded on this PC. Okay. Then you, when you have to watch carefully, the annotator called it, sorry. So there is condition. Not just it's inherent. Later in 2012, Linus Torvott showed a very short several lines example how to deceive the tracer in a reliable way. So it's actually not a race if you can deceive it in a 100 percent, and you have a 100 percent chance to deceive the tracer. It's also extra system call in vacation, but compared to the unreliable result. It's not really a big deal. So what could we do? Actually, this problem was known to kernel developers for quite a while. And in January 2012, there was a lively discussion in the kernel manual list. It started with a IRF-C patch to propose a feature that later became known as C-COMB-BPF. And during that discussion, they found out that they proposed implementation didn't take these compact processes and the whole issue into account. So yeah. And this nice person, he was a maintainer of Petrae's jailer, or Petrae's sandboxer. They were quite popular those days because there were no C-COMB. So he was a maintainer of this thing, and he was like very surprised. Most as surprised as this. He couldn't believe that this feature exists and wasn't almost undocumented. And several well-known people suggested various solutions for the problem. First was Linus, who suggested to abuse eFlex because probably nobody uses them. So just use two high bits from eFlex and encode the information whether it's compact or not. But then Hans-Peter said he doesn't like this hack, and he suggested to use another hack and abuse CS register because this register is less likely to be used in those high bits. So he thought his hack was nicer. Then Roland said that he doesn't like that kind of hacks and suggest to use Rixette mechanism or maybe just introduce a new Rixette flavor. This would be nice to already existing programs and so on. Then Linus said that why not to extend one of already existing Rixettes? Then Roland responded that this would break compatibility with already existing software like core files would change and so on. Alex suggested to use a new flag and deliver new Petrace events. We have Petrace or Trace is good. He suggested, of course, to use Petrace. Trace is very good. It's nice of him to suggest this good name. So introduce four new events. Petrace suggested not to introduce new events, but instead provide this information using Petrace get event message instead, so you can make an extra system call and obtain this information. But as you can imagine, the end result of this was that SIGC Comp was finally implemented because there were Googlers who were behind this feature, they wanted it and they finally made it into the kernel. But nobody really wanted this feature in Petrace. The evolution of this problem went through all these classic steps. First, people said the problem doesn't exist, but you clearly saw it does. Then they agreed that there are no consequences, no problem that if this call is printed wrongly. But then came people who were relying on the correct information about system calls. Like if you are maintaining Petrace's inbox and it makes the wrong decision, you are out of the game. Then people say that the race was not practical, but Linux showed them that it's more than practical. Then people, quite known people I would say, suggested a lot of interesting ideas. And one, every idea was objected by one or more kernel developers. So it was like a nice discussion, very lively. You can actually find it and read. It's an interesting read. Yeah, but nobody came up with a real patch. So nobody was really interested. No big clients came to big vendors to request a feature. Researchers were busy researching other kinds of stuff. So nothing changed. There was no follow up for this. And in pre-software, if you want to have a result, you obviously need to find a person who is interested, who cares about, and who is able to. And I found out about this discussion only in 2017. Because these people didn't care to like CC, a maintainer of the Strays, why bother him, right? No need to. So I found out about it in 2017. And I asked what was the conclusion of this discussion, what they decided to do. To be honest, I didn't expect any response at all. But Angie responded. He said that he opposed to all those proposals made five years ago for various reasons. They were more or less hackish. And he said, well, let's use the positive result of SIGCOMB-BPF and just introduce a new Petrae request that will contain all the information necessary to find out all Cisco details. And yes, and let's use this arch field that's in SIGCOMB data structure that describes the architecture of the system call. So if it works for SIGCOMB, it should work for Petrae. That was a suggestion. So I asked him how does he propose to implement this? Because the internal kernel API didn't have anything to... It didn't have the most crucial part to implement this suggestion. There was no way to find out the architecture of another process. You could find the architecture of the current process, but not another process. So I asked him this, and what do you think was the answer? Yeah, this was the answer. It was total silence. So I thought I really don't want to implement this internal kernel API. So there was no follow-up of this. And only in 2017, the person was found who submitted a RFC patch. It was 7th of November 2018. We used to work together at the time. So it was submitted at the first RFC, and there was a follow-up. Very shortly, Alec responded. Alec is Petrae's mentor in the kernel. Alec responded, and then Andy responded. I wonder why Alec responded so promptly. Maybe the reason is that they live in the same city and actually could meet face-to-face and discuss things. It's actually... I would say it's very useful to have a kernel developer living nearby. Yeah, it helps. And in general, it helps if you can talk face-to-face. Sometimes it helps to solve questions. So yeah, that was the very first approach. In this approach, we decided to set the main problem, which is how to report the architecture when there is no kernel API. Nobody wanted to implement this kernel API. So she decided, let's hope Andy didn't notice, and we'll just report the comparteness of process. But as you can imagine, that hope was in vain, Andy noticed. And he insisted to implement this the right way, that is to implement this field. So in the end, we had no other options, either to drop the ball or to implement the thing we wanted to avoid. So there was an agreement between me and Alvira that I will be implementing the boring part, and she will implement the pjs part. So in the second edition of the patch set, you can see it more looks like second data, that this compact field were replaced with a proper arch field. Also instruction pointer was added, like in second data. It proved to be useful later. But as you can imagine, to implement this, I had to code a lot of boring stuff. So it was about 16. It was 16 in the first iteration, 16 patches to extend the API that belongs to the system. At the same time, when they started doing this, we started finding various bugs. And when you find bugs, and the fixes are quite small, and when architectures are well maintained, they are promptly fixed. So some of these fixes were very shortly merged. Like documentation fixes are merged very well. They don't usually break things. But when you do something bigger, it takes a lot of time. In the third edition of this, so there was some changes under the hood in the Petrae's implementation. It was quite a lively discussion all the time. And on the API level, we decided to make this field available not just for entering system calls, but also for other Petrae states. And for some reason, we decided to add stack pointer and frame pointer. Well, the reason was that in this case, stack pointer is actually useful. And we thought it's available, so why not? This decision, it was later proven to be not very good. So yeah, that was the first edition of this. You see, it's like getting slightly bigger, but you don't see the audit part of this, because there was no need to respan it. I was waiting for response from architecture maintainers. And for some, well-maintained architectures, I got ax. But you know, there are quite a few architectures in Linux kernel, and some of them are very poorly maintained. Like, some didn't get any commits for half a year. Unfortunately, at this time, Elvira was no longer able to take part in this project. So all the rest. I had to do myself. Yeah, in the fourth edition of this API, after request of Kiskook, we decided to report also a Sikom stop. So yeah, it's very similar to Cisco Enterstop with the addition of a field describing Sikom return data. So this helps, this helps Tracer. It doesn't have to invoke extra Cisco to obtain this information. And in the fifth edition, so it's about month from the beginning, we, well, it was me, yeah. Some fields were moved to the common part. And also I had to unite these two parts into a single patch set. What was the reason for this? Well, because if you want kernel CI to be involved, they don't really play, they don't really do well when one patch set has to be applied on top of another patch set that is submitted. So to get some kernel testing, first I wrote a soft test, which is actually about one third of the whole patch set is the soft test. By the way, I really recommend everybody who extends API to add a soft test for it. It not just helps to test the new feature, it also helps to discover how to use it. For me it was quite simple because it was almost the same code let's use it in this trace to test where the kernel implements the feature in the right way. So yes. In the sixth edition of this, which was in the middle of December, there was a small change in the API and it was the last one. So about a month after the first RFC patch, the API was ready and soon after that I released a trace in the end of December that supported this API, but there was no kernel. Well, some render kernels were backporting this. Certainly kernels we distributed in base app, but others couldn't use this feature. This was the largest, the sixth edition was the largest patch set of all. Why was the largest? Because of all this architecture stuff and because some of these patches went through architecture trees later. So yeah. In the seventh edition was exactly the same, it was just re-based to the release, to the first RFC of the 5.0 kernel. It's actually smaller because some patches went through architecture trees. You can see it's quite big, a lot of architectures, but well, about third of it is the test. So at this moment it was clear that from one point, our idea to implement this retrace gets it's calling for API on all architectures. It failed miserably because we found out there is an architecture called alpha. It's not really that we didn't knew about this. We knew that alpha architecture exists. I even had a shell access to a box on alpha, but apparently alpha is a strange architecture. It doesn't implement a way to obtain a user stack pointer of another process. So in the end, we decided to drop support for alpha and a few other architectures and limit it just to only those that implement trace hooks. Now, it was about 19 architectures and a few were excluded from this. And then it was clear that the patch set is too big. It affects two different subsystems. And in one subsystem there is a maintainer who use a regular way of accepting patches. He has a tree. Another subsystem is Petrace. Maintaining of Petrace, it doesn't have a tree. So you have to submit patches to the maintainer of last resort. And he use patch kills. And the idea that you probably can put the whole thing into one of the subsystems, I don't think it was practical. So we decided, yes, it was too big and diverse and let's divide it. Yeah. So it was divided back into two parts. The first part is pure audit stuff. It had to be pushed via audit tree using Paul Moore's tree and all the rest would have to wait. When this is merged and then pushed to Andrew Morton. Andrew Morton is a universal maintainer who cares of everything that doesn't have a maintainer who accept patches. So he's a kind of maintainer of last resort. Okay, so first it was a push through audit tree. We were not very lucky with timing because when we decided to do the split, the merge window closed. And you know how the linear kernel development cycle is, there is a release, then about two weeks of merge window and then about six window of testing and bug fixing. And at this time, after the merge window, you can't unless you're a very, very prominent person. You can't submit a new API. I was not very prominent person, so I had to wait. So we waited one race cycle and also Paul Moore was not really very eager to review audit parts. He was eager to review audit parts, but he was not really ready to review architecture-dependent parts. So he waited for architecture maintenance to respond somehow. So I was pinging these people. Actually, I managed to collect a few more acts for this. And finally, when another kernel merge window opened and closed, then Paul Moore was ready to merge all this into audit next and then to Linux next. And then when the next Linux kernel merge window opened, it was merged. Finally, to Linux 3. It was May. When the audit part was merged to Linux next, I started pinging Andrew Morton. I had a hope that maybe he can merge it into a one of his queues that gets some testing on top of Linux next. So I was pinging him. Yeah, the patch set wasn't changed. Maybe I got some acts at this point. Actually, yeah. Maybe one patch was merged with the arch tree or something like this. But until the audit patch set was accepted into Linux 3, there was absolutely no reaction from Andrew Morton. And only when it was in the Linux tree, only at this moment, when another merge window opened and closed, only at this moment, only at this moment, Andrew accepted into his MM patch queue. So, yeah. And we had to wait another. Canary this cycle and yeah, it was only July, then it was finally in the Linux tree. So in short, it was 29 commits, 47 files changed, like about 700 insertions. A third of them is a soft test. Two authors, also 20 people who added the acts, reviews or sign advice. The whole process took almost nine months from the 7th of November till 7th of July. Some people managed to do more important things in their lives in nine months. This, yeah. But well, it implements this new feature on those 19 architectures that enable trees hooks. The API looks this way. It's a structure. As you can see, it contains this arch field with audit arch value that costed us so much time. You can obtain information about the type of system, call stop, which is important because there are no other ways for user space to find this information and to obtain the information common to all system call stops and specific to this particular system call stop, like system call entry stop, exit stop, and second stop. Yeah, you can find this in Linux kernel headers. They don't have these nice comments. Probably we should add them. Or maybe not because there is also a description in man page, so you can find these nice comments there. Maybe it doesn't worth the trouble. Yeah, so back to this first example. There is no longer problem, no problem. It works.