 This talk is titled Efficient system co-emulation for gaming in Linux. It is about a mechanism that we developed to solve a very specific problem that started to appear with modern games designed for Windows that we're trying to bring to the Linux world and run them in Linux. My name is Gabriel. I work for Collabra and this is actually part of a much bigger effort to improve the gaming ecosystem in Linux. Since this talk is going to go a little bit deep on system calls and how they work, I'm going to go through a very quick introduction of what a system call is and how it usually works when invoked by an application. So basically a system call is a mechanism for the kernel to provide a specific service to the application. It's the main interface that an application has to reach the kernel and ask for something, ask for a specific feature. Usually the kernel and a process they execute in different modes of the processor. The kernel has access to the entire machine. It has access to the devices, to the peripherals connected to the machine while the application operates in a much more restricted space where it thinks it's operating alone on the machine. It has a view of only its memory, you cannot see all the processes. So the system call interface is that barrier where the application can reach the kernel and ask for something, for instance read a file or write to the standard output. One interesting detail about system calls is that they are rarely invoked by the application itself. It usually goes to libraries. So a programmer is not going to call write and execute the system call instruction itself. This is going to usually call libc to a very high level function like printf and that function will eventually call write. System calls, the system call interface, they are operating system specific in the sense that each operating system provides a set of system calls that makes sense to implement that operating systems, ecosystem programming interface, API, but it's also specific to the architecture in the sense of how you invoke and how you pass arguments to a system call. Regarding the specificity of the operating system, if we look at Linux, Linux is going to implement system calls that are useful for a POSIX API, for a POSIX programming environment. For instance, it implements create, open, read, write, those are system calls that are useful for a higher level to implement the POSIX API that any programmer in the Linux environment is used to. But it also exposes system calls for specific Linux features that are not POSIX specific. For instance, it supports a system call for BPF, a system call for SICOM, for KSAC load, and several other system calls. Windows, exactly like Linux does, it supports its own programming environment. So it supports the WinAPI primarily. So Windows is going to implement a subset of system calls that are interesting to help support that library, that interface. Likewise, for OSX, for Mac, it's going to implement the POSIX API and also the SIS calls that are useful for the programming environment that a Mac developer can expect. But system calls, they are also architecture-specific in the sense that they are invoked through architecture support to a specific ISA instruction. For instance, in x8664, there is an assembly instruction called system call that makes the processor enter the special mode where the kernel operates. When invoking a system call, the way that you pass parameters to the kernel is different than how you invoke a function. So there is no specific GCC support to call a system call. You need to actually implement assembly code. And that is the reason why applications are not expected to call system calls directly. They usually go through a library. One important detail is that other libraries, they also are very unlikely to call a system call directly. They usually go through libc or some system library, because there is not much sense in re-implementing the SIS call handler. That doesn't mean that an application cannot invoke a system call directly. It can just, you can just hard code your assembly function that executes, that wraps the arguments and invoke it. But this is not usual, but it's completely possible. And as we're going to see, applications in Windows are starting to do that. And that's becoming a problem for us. So just a word on libc support. When I call about calling a SIS call, I don't mean calling the libc function SIS call number 2, which would be a generic handler to invoke any kind of SIS call. An application that calls the function SIS call is actually calling into the glibc which will, in turn, execute the assembly wrapper that calls the system call into the kernel. So it's very different to actually execute the system call, which happens inside libc in this case than calling the function SIS call. And the same thing happens on Windows. So Windows has this API, this library that implements the WinAPI. And every time a Windows application needs to call in for a certain function, it's going to call WinAPI, who's responsible for calling the kernel to execute the SIS call. And what we observed until now was that Windows was a bit more strict than Linux where it's much, much, much more rare to find an application that would call the system call directly. Instead, they would always go through the WinAPI. And there are several reasons for that, in particular because this is part of the documentation in Microsoft. They tell you to not call system calls directly and the SIS calls themselves are less documented than the WinAPI. So when we run games that we bring from the Windows world, we run them through Wine and through Proton, which is a collection of programs that includes the Wine environment, but it adds a lot more features to provide an optimal emulation environment for Windows games running over Linux. And everyone says that Wine is not an emulator. But why isn't it? Well, in fact, you need to think that Wine is a compatibility layer. What Wine does is it re-implements parts of the WinAPI and runs the program, runs the Windows code and serves it through that API. So when the Windows application that is running over Wine needs to call a WinAPI function, it's going to call Wine who re-implements that function in terms of Linux operations. So Wine, when the Windows application calls creates thread, it's going to actually call Wine to a library call as usual. And Wine is going to re-implement that to fork, to clone, or to whatever. So Wine provides that abstraction layer that converts Windows operations into Linux. One important detail about Wine is that it has no virtualization at all. It doesn't do KVM, it doesn't do any kind of virtual machine. It's not a sandbox in any sense. Wine and the Windows application, they run on the same process space. So they are basically a single process. Obviously, there is the Wine server, which runs on a different process. But the important parts of Wine that we're discussing here, they run on the same part of the process as a library. But what happens now is that we have modern games that are invoking the architecture-specific syscall operation directly in the Windows code. So instead of calling WinAPI, which Wine is emulating to execute some system call, it goes directly to the kernel. And Wine has no way to notice that syscall happened before we reached the kernel. So in this sense, we have this application that thinks it's running in Windows. Think it's calling Windows kernel, but then when it executes the system call instruction, it reaches the Linux kernel with a very different ABI. Remember, the ABI is operating system-specific. And when we reach the kernel, the kernel is going to look at those arguments that might not even make sense for a Linux kernel. And then we have undefined behavior. Either the Linux is going to misinterpret those arguments and execute something completely unexpected. Or most likely, we are going to return an Invol or any other error condition to the application. And at that moment, the game will likely crash. So we need a way to solve that. And we cannot simply go and say, OK, let's recompile the games. Because games are usually proprietary applications and we have no support or very little support from game studios themselves. So it's not a matter of just rewriting the game for Linux. That involves high costs. And usually, studios are not very interested in that because it's basically a chicken and egg program. We try to improve the Linux platform to attract more attention from studios. So recompiling is not an option. But why are games executing system calls directly, which seems to go against what Windows has always done? Well, the main two reasons that I found for that is, one, DRM locks. The games want to control the entire execution stack up to the kernel so they can verify that the game is not being modified at any point. Also, anti-cheat mechanisms. They want to be able to observe what is happening on the game at all times and control the entire execution flow to prevent cheating. So we described the problem. And now let's see what we're trying to do to solve it. Just a quick warning. We are going to be looking a bit of assembly, but that is going to be as gentle as possible. So the first attempt that we made was let's patch the game at execution time, at startup time, and replace all the system call instructions with a call to wine. So we had this low-level wine wrapper that can be invoked. We patched the game code to replace the system call instruction itself with the syscall handler that executes in wine. And then we are able to capture the execution before it even reaches the kernel. That's a very smart idea. It's something that is not novel. It has been done before in several cases. For instance, there is a Linux library called lipsyscallintercept that does exactly that. And even on other instances on games, we do that for case-insensitive file systems fallback, for instance. Unfortunately, in this specific case, we cannot do that. And the reason is any attempt to modify the code memory page in-game, we are going to trip anti-cheating code. So there are threads in the game that keep verifying the memory pages to check for corruption. And if they find corruption, they might consider someone is tampering with the game and the game will crash. So the memory space of the Windows application for instance is off limits. We cannot modify it in any case. So the only alternative is we need to go into the kernel. We need to let this row Windows application reach the Linux kernel. And then we need to find a way to send it back to wine for emulation. This is also not novel. There are a lot of applications that do that. For instance, GDB has a mechanism to intercept system calls when they reach the kernel so you can stop, break the program and debug it. UserModelLinux also uses the same similar interface for emulating system calls in user space. And also there are container technologies to do that. So for instance, GDB will use ptrace. UserModelLinux will also use ptrace.cmu. There is also SecComp to do that. But when talking about games, performance is a very important thing. We don't want to ruin the game's performance by doing a lot of unnecessary work just to fix a small thing that is being forced onto us by anti-cheating software. So we need to consider one thing. We have this application that has multiple personalities. Part of the application knows it's running on Linux. It's wine. It knows that this is a Linux kernel. And it's able to execute system calls natively Linux system calls. And those are calls they need to be able to execute super quickly because they are going to happen all the time. But for those specific system calls that come from the Windows code, that comes with the Windows API, we need to be able to quickly intercept and emulate. So the first question is, we need to be able to differentiate those two types of system calls coming into the kernel. But we cannot simply look at the ABI of the system calls to decide if that ABI makes sense because we cannot simply accept that something that makes sense by accident, even though they come from Windows, and we let them execute. We need to be sure that whatever is coming is either coming from Linux side, the wine side, and we can let that execute, or it's coming from the Windows side, and that needs to be stopped and emulated. So the first approach that we took was an attempt to do all this filtering back in user space. What we did is we basically installed a small firewall using Secom at the start of the Cisco emulation in the Linux kernel. So any Cisco that reaches the Linux kernel goes through that small firewall that verifies, oh, the system call is not coming from this well-known, allowed address, which is part of the wine Cisco emulator, and we forward it back to wine and say, wine, we have this system call here, you need to figure out what to do. And then wine will filter that Cisco. See, it came, for instance, from a memory page that is known to be wine and dispatch it to the kernel, or it's going to try to emulate it. That means that any system call that is coming from libc, from wine, and from the Windows application are going to do a round trip to the kernel and come back to wine. And that is a problem, because this implies multiple switches between user space and kernel, which triggers a very big performance hit, in particular in a post-meltdown world. So this triggers a performance hit for every system call that we are doing in a game, which is unacceptable for us. So an approach where we are filtering all the system calls in user space could never work. Next, we try to do a second base filter, which, well, basically is what second is for. Second is a mechanism to filter and block and allow system calls. So we have this small CDPF filter, we cannot trust the ABI at all. So what we do is we look only at the origin of the system call, the instruction that executed the system call. This way we can allow the libc and the wine system calls to go into the system call handlers without going back to the user space, but we need to do some filtering in the kernel. But that also has some problems. The main problem is the CBPF filter. This is not eBPF. This is a much more limited version of the BPF language, which means that we require a lot of instructions to do this verification. Also, we are looking at the static filter, which doesn't look at different memory regions, for instance. So we need a compiler inside wine that generates a large filter for each memory area that might be coming from wine or from windows. And as a result, our filter get normals. For a single memory region, we need to do four or five CBPF instructions, and that filter grows a lot. And a big filter implies on a slow execution for any system call. Every system call is going to need to go through that filter, and that's going to take a while. In addition, memory regions, they come and go as the program makes a location, loads, libraries, on the windows side and on wine side. For both of them, we would need to update filters. But SecComp is designed as a security mechanism, and there is no current provision to remove filters that have been added. So we would need to modify SecComp a lot. And then, this is actually the first proposal that I brought to the mailing list. And the idea was, let's write a new mode for SecComp, that filtered programs based on a protection bit, which said whether that memory page could execute system calls or not. And the beauty of that idea is that even though wine and the windows application, they run on the same process, they are always on separated pages. Wine is loaded on a subset of memory pages. Windows is always going to be in a different page. And wine knows whenever a windows application loads a new page, because wine implements the WinAPI functions to load a new page. So the idea was, let wine mark those window pages with a no-system call bit, that says that that SecComp needs to be blocked by SecComp. And then SecComp can do whatever they want with that, redirect to user space to the wine syscall emulator, block it, or do whatever it wanted. And then the filter doesn't need no longer to be, to check for every memory region to see where that address came from. It does need it to look for the actual memory region, for the VMA of that memory region, which is much cheaper in the curl. And then it just checks for that bit and returns back to user space, if that came from Windows or let it proceed. And that gives us very good performance for both native and for emulated system calls. But there is a problem with that approach. We still need to be able to set and unset the no-syscall bit, but the main problem is SecComp is a security mechanism. And when we're talking about some pages being able to execute system call and some pages not being able to execute system call on the same process, this is not safe at all, because there is no isolation, there is no real architectural isolation between a line page and a process page. So think of a rogue application that is trying to be protected by this mechanism, then Windows, this rogue application can just jump into a wine into an unprotected page, execute the system call and come back. So this would mean that SecComp has a mode that is not specific for security. And that is not a good idea for a mechanism that was designed primarily for security. But in fact, wine doesn't really care about it, because as I said, wine is not a sandbox. Wine trusts blindly the application that it's executing. So wine doesn't care about security, but SecComp does. So this mechanism doesn't fit SecComp. Then the only solution forward is we still need a filter. The filter needs to be in kernel space, but it cannot be SecComp. So we need a new mechanism. And this mechanism is what I'm presenting here today. I'm calling it syscall user dispatch. And it implements a sort of selective firewall of system calls that can be controlled entirely by a controlled variable in user space. It also exposes a fast path where you can declare an allowed region of memory that can always issue system calls directly without going through the firewall. So if you look at the picture on the right, you can see that the firewall has been squeezed to open a fast path for libc. And that doesn't necessarily have to be libc. It can be any library that always executes system calls in your system. Basically, wine or any application using this functionality specifies which range of the memory address is able to execute a system call. It has its similarities with the BSD dispatcher and stuff like that. And the boundaries that we really care about is the first boundary on that picture, which goes between the Windows application and wine, where on the upper side of that boundary, no syscall can really be executed by Linux directly. And under that, below that, any system call can be executed, can be trusted that they are Linux native and they can be executed directly. So we need a way to quickly cross that boundary and let the kernel know whether the firewall should be up or down. This allows us that a Windows application won't send a system call down the kernel that won't capture but still provides us with good performance for things that can be executed. So that is the interface that we really care about. And the way we inform the kernel is through the most simple mechanism ever. It's just a control variable that is shared between the kernel and user space. That variable can be portrayed per process. And if it's set, the kernel will reject any system call coming from that process if it didn't come from the fast path that is assigned to one only region in memory. And then when crossing that boundary into wine, wine can just turn that variable off and on, depending on which side of the barrier it's going to, whether it's going to wine, it's going to disable, if it's going back to Windows, it's going to enable. This provides wine with a very quick mechanism to configure this system call user dispatch. It doesn't even need to go into the kernel to disable or enable this firewall. And the kernel cost of performing this filtering is also very cheap. The only thing we need to do is a get user. So as I'm going to show you later, this is a very optimal performance for gaming. A few comments on the design. You should have noticed that this is a very specialized design for a very specific problem. And that is a trade-off that we often see between very specialized designs or generic designs and good performance or less than good performance. And this is an obvious trade-off. But in cases such as games where we are trying to squeeze the best possible performance and the result of the performance increase is directly related to frames per second, this is a justifiable approach. In fact, we could observe that the overhead of executing a fast path system call, which is basically a system call coming from libc, goes less than 15 nanoseconds on x86, which is basically the cost of jumping into the function, verifying that the variable is off and going back. So this is a very interesting mechanism that completely solves this problem and allow games to run unmodified. What are the improvements that we can do there? Well, first of all is we could improve the way that we redirect system calls to user space. The way that we do that right now is through a signal, a 6S, that goes up to the application. And in x86 in particular, signals are quite costly. It can get up to 500 milliseconds to deliver a signal. So the question is, can we do better? And the answer is yes. I experimented with a mechanism that was suggested by one of the x86 developers, where we have a very raw return into a scratch area in wine. It works basically like this. You basically receive all your arguments of the system call in the system call ebi, and as soon as you detect that I'm not going to execute the system call, I just return exactly, I just jump back to user space, changing the protection mode of course, but keeping the same ebi. And we give the responsibility to the wine syscall emulator to basically reimplement the syscall handling. So it needs to interpret that ebi and go back into the, and then reach C code. This increases the complexity of the wine emulator, but improves performance a lot. Since we don't need to spin on several things inside the kernels to deliver the signal. The point, the question is whether that is worth it. Because on the games that we observe nowadays, the occurrence of a system call that goes directly into the kernel is quite rare. In fact it happens so rarely that it doesn't really affect performance when we need to emulate a system call. In addition, the cost of emulating a system call is so high that the cost of delivering the signal doesn't seem to matter. So it's alright for now that we don't need to change the mechanism to deliver the system call back to user space. It's fine to do it through a signal, but that may change in the future. The most important thing for us is to be able to filter the system calls that are coming natively and dispatch them quickly while still capturing rogue system calls and sending them back to user space. Now, when I presented this work before, the main concern was about security. Given that I'm allowing any application to emulate its own system calls, is that a security concern? And the quick answer to that is maybe. We solve that to the best of our knowledge, which means that we prevent any application from emulating the system calls of a child of a different application. So the attack surface is completely pretty much eliminated in the sense that you cannot have one process emulating the system calls of others. This is fine for wine because the same process is going to be emulating the system calls that come from Windows, but it limits the usability of this mechanism for other applications like tracers and debuggers. But this is a security trade-off that was important for us. About the status of this work, I have submitted a, I believe, version 7 before I'm recording the stock. It's received very positive feedback upstream and is very likely to be merged soon. There is only a small issue unrelated to the mechanism itself, but about how limited the resources are on the Cisco handler in x86, 32 bits in particular, that I'm trying to figure that out to allow this mechanism to enter. The mainline kernel. Hopefully it's unlikely to make it into 5.10, which is at the point of my recording and is going to be released in five days. Hopefully, we are at the RCA now, but this is likely to make into Linux 5.11. This is what I had to do. I am required to mention that Collabra is hiring. So if you are interested in working in challenging gaming issues improving the gaming experience on Linux or a lot of other interesting open source projects, reach out to us. And I guess I'm taking questions while I'm doing this presentation. So thank you very much. Bye.