 Hello, let's talk about how we are trying to make Linux better for games on Windows. So they have been a lot of effort going on in different areas of the kernel. So I've been working on one. So I'm going to tell you in the form of a story that how we found a problem and how we are trying to fix and how it is still kind of in progress. I'm Samma. I work for Culebra. I am trust in core kernel and I'm really enjoying the city here. So I've been visiting around here. So it has been fun. So there are some Windows APIs which are kind of essential when it comes to games sometimes. So this is get write watch. So this API returns that which addresses have been written to. So games use these API to find out that which that these pages have been written to. So these things are used in different kind of stuff like implementing copy on write mechanism. For example, if game need to find out that there is a pool of 100 pages memory and it wants to find out that which pages have been updated so that it can only copy those to your hard drive. So these, this API will return us that these pages are having become dirty and that those pages will be copied only. So it has some different semantics like you need it will only return. It has an argument that you can ask it for only a specific number of pages. For example, there may be 10 pages, but it can ask that I only need to find three dirty pages so that it can only deal with that. So this we need to keep take care of these arguments very carefully when we are going to want to see how we can do this inside Linux. And this also the this API is also widely used in security intrusion detection and debugger detections. So it only makes we have not seen much intrusion detection in case of games because a lot of games are online competitive. So they need to build a mechanism by which they can find out if someone some external program is affecting writing to your games memory. So this API is quite important and also some garbage collectors use it. So there is another API. It is kind of sub part of previous one. It just resets the state for specific pages. So what do we want really? So we want to translate these APIs by wine or some other translation mechanism. And then we want Linux kernel to handle it. So they can be different mechanisms like user space can can handle this. We can try to not even ask kernel to track these things. So the first thing which comes into mind that let's try to use and protect and segue. So right now wine is using using this mechanism. So what happens is that we write protect a memory. And we also register a segmentation handler. So when a memory doesn't have any kind of right permissions. So when we when some some program will try to write to that memory, we will get a segmentation fault. And in that segmentation for handler, we can do the bookkeeping that which pages have been written to and which pages are still not written to. And this bookkeeping can just help us to kind of translate Windows APIs on Linux in just user space. So it is not a very fast or good mechanism because signals are kind of positive standard, but they are not very fast enough. So as you can see, there would be some communication going on between kernel and user space. And also, even if we use this, there are some some things which don't work. For example, there are some drivers which don't like right protected memory. So there may be some mechanism, developers can do anything right. So you are restricting the way how the memory works. So that can get fail at some point in some application. There can be another mechanism. So there is a kind of recent system call in Linux. It is called user fault FD. So you can initialize a file descriptor. And then you need to restore your memory with user fault FD. So this is kind of same like mProtect, but this use different mechanism. So we are not right protecting the memory. It is just another feature. So it will right protect memory inside the kernel. And whenever we will, any program will try to write to that memory, we will get a message by pulling this file descriptor that this memory is being written to and what do you want to do. And we will do the simple thing just like in the previous mechanism. We will just resolve the fault and we will just do the bookkeeping that this page has been written to. So this is even slower than mProtect because whenever someone would write to that memory, kernel will send a message to user space through this file descriptor that this is being written. So then user space need to handle it. So there is this communication between kernel and user space. So that is not very efficient. So they have been some cases reported that this creates problems for loading of games like it needs to load in 10 or 20 seconds, but it take more than one or two minutes in some cases. So before talking about memory management further, so let's talk about some basic concepts very quickly. So memory when we talk about memory management, so kernel deals with memory in page size chunks. Page size can be different on different platforms. On x86 we use four kilobyte of page size. So there are tables which are used to map virtual to physical addresses. So and these pages have entries and these entries have the page frame number which is a physical address in the memory. Protection bits like does it have read permissions or write permissions or stuff like that. For example, we have talked about write protecting a memory by mprotect syscall. So that will change this protection bits. Then there are some status flags in page table entry. So this is what we are going to talk about further. Then there is a virtual memory area or commonly called VMA. It is used to track the processes memory areas. For example, process has BSS or data segment or different segments which have kind of same attributes. So it is kind of an internal structure in kernel to keep track of same kind of memory. And page fault is an exception mechanism. So whenever kernel finds out that page is not present in the memory, so it generates a page fault that is internally handled that kernel loads that. But sometimes if a page doesn't have write permissions, so it will generate a page fault as well. And there is last one is TLB. So TLB is kind of a cache. So it keeps track of the last entries which have been translated so that a translation phase can become faster. So a process asks TLB first. If it has the mapping, then it uses that. If it is not present, then it loads from page tables. So it will glue in the next slides. So kernel already has soft dirty PT flag. So it is a software only status flag. So it is not like there is also a dirty flag inside hardware. So we are not talking about that. So this is soft dirty flag is only a software flag, which is kind of similar to what we are looking for. So how does soft dirty work? So there is a writable bit of page table entry. So it is just cleared. And then when someone tries to write to that memory, so inside the kernel, page fault is generated and that page fault is handled inside kernel and it finds out that this has happened because of soft dirty flag not present. So it handles the flag, handles that fault very quickly earlier in the handler. So it is not a very good way, not a very slow one like we have talked about other mechanisms. So we only have two operations available right now. So we can find out the flag for all the present pages of a process by reading page map file. And we can also clear the soft dirty flag of entire process. So there is no way to clear flag for a particular memory only. So that definitely is a shortcoming. But when this feature was being written, the software dirty one, so they found out that some memories can get reallocated and reallocated again. So soft dirty will not be able to track that. So they added another flag inside VMA. It called VM soft dirty to track that which VMAs are dirty. And then soft dirty is not the only PTE flag. So it is now all between PTE flag and VMA flag. So if any one of these is set, we will consider it a page soft dirty. So we will talk about how this VM soft dirty is going to create problem for us. So when this got added, there is another thing is that so kernel wants to keep less and less VMAs. So whenever there are some similar VMAs of having same flags, so kernel will try to merge those. So when soft dirty got added to VMAs, so it created a problem is that some VMAs may not have soft dirty. So they will not get merged. And that will create a problem like number of VMAs would increase a lot. And we don't want that. There is a maximum number specified inside kernel. You can set it by a CSTL, but we don't want to do that. It would be a regression. So they looked at the problem and they solved it like they should not merge. They should just ignore soft dirty flag while deciding if VMAs should be merged. So this was back in 2014 or something. So this created a problem like, for example, if one VMA is dirty, but one is not dirty. So if the board gets merged, all of two merged VMAs would become dirty. So if you are expecting only five pages to get dirty, but there would be 10 pages, so it has become not a very accurate mechanism. So overall, the shortcomings is that it is not accurate. Soft dirty flag on a part of memory cannot be cleared. Atomic get and clear operation is not possible, which is what we want because we need it to translate, get, write, watch. So CRIU project is checkpoint restore in user space project. So it is using soft dirty from quite a while and probably they have introduced it inside memory management. So they also use this soft dirty flag, but also they are also not very happy with it because they sometimes need to freeze the process to take a snapshot to find out which pages are dirty and copy them. So this project is used to take a snapshot of a process and then you can run it again later or maybe transfer it to some other machine. This is also used for container migration. So sometimes they are also post-migration crashes. So we found out that the CRIU project is also need this kind of some other mechanism, better mechanism to deal with these kind of problems. So we decided let's add eye octal based on soft dirty flag. So we will add a clear operation on a particular range of memory, which is not already present. We will add a get and clear operation atomically and then we don't because we want accuracy. So we will just ignore this VM soft dirty flag by just taking some argument from user. If he want accuracy, he can just we can just ignore it. And we were going good downstream patches were already upstream. We were sending it probably fourth or fifth review and we found out on our latest release that it has been broken. So then we have, we had started working on this. We were seeing that pte soft dirty pte flag was being set, regardless if VM is dirty or not, but this was not the case anymore with the latest release. So we try to argument with them, but nobody listened because it doesn't, it had not broken the user space. So user space was acting the same. So that's the rule was being followed. So we thought let's fix soft dirty functionality once for all. So there was a bug kind of bug in soft dirty feature creation was that whenever a new VMA was being created. So we were checking if we want, if it can be merged with the previous VMA. And then afterwards we were setting soft dirty flag. So this was the original bug, which they didn't found at the time. And they allowed VMAs getting merged without considering soft dirty. So now that I found this issue, so I tried to get it merged, but they said this will definitely cause regressions because this is a very old feature and we never want to break others. So I thought why don't we just keep track of which part of VMA is dirty and which part of VMA is not dirty. So I spent some time, I came up with a patch. And we were tracking that which part of VMA is not dirty because it was easier to implement in terms of when we looked at the code. So that was working fine. This would have increased VMA's structure size by 8 bytes and then each entry sub region of the VMA would have increased the size by 32 bytes. So that would have been a lot. So VMAs are widely used inside kernel and kernel developers are always looking to how they can decrease the size, not increase. So this is also not a very famous or most used functionality. So they just asked us, this is not possible. So we thought let's just leave SolveDirtyFlag now and instead let's do something with userFaultFD. We can add our writeProtect asynchronous feature and we can just update the iOctl. So we have already talked about userFaultFD writeProtect earlier. So in that case, for need to be handled from user space. So there is a messaging going on between kernel and user space. So this would be a new writeProtect mode where kernel can resolve the fault by itself. So this would not be a original intention of a userFaultFD designer, but this is what we are looking for right now. And there was no other way. So we just added it. So a page is considered dirty if it is not written. It is not writeProtected. So that was easy to do. But later on we found out that if a memory, if a PTE is empty or it is called PTE none, so it will not remember that the state was cleared or not. So we had to add another feature to userFaultFD. Which is unpopulated. So it will just place PTE markers instead of putting flag because PTE is none so we cannot put any flag inside PTE none. So it worked quite well. And then we updated pageMapScanOctl. So we started using this writeProtect asynchronous and started giving input in struct. I will show this struct later. And then we had to include do all of the scanning for all memory types because we want featureComplete patches. So we included huge TLB as well, which was not included because it was not, Vine doesn't use it. And we were not sure if CRIU also needed it. And we also needed to handle holes. So now we have four types of memories like normal pages, transparent huge pages and huge TLB and finally holes. And then the supported operations are that we need to scan the address space. So there is always a get operation. So instead of putting a flag for get operation, we have decided that whenever user provide output buffer, it means he want to get some data from kernel so that get operation would be performed. And then this W writeProtect matching flag would be used to writeProtect the pages. So just like I've said, writeProtect, when we writeProtect a memory, so it means that it is not deity anymore. So deity and written to are kind of synonymous here. And then just to, sometimes you may have a larger area of memory and some part of the memory is not initialized or registered with userFaultFD. So you will get error in that case. So sometimes user may want to avoid that. So we have put another flag, scan, check WPA sync. So if this is set, so we will just abort the operation if we found out that this region was not registered. Then we've got some more feedback from CR IU developers. So they said that you are returning one byte for one page. So that's a lot of data. So why don't we just compact all the return data from kernel? So then we decided to return ranges, address ranges with their flags instead of returning the addresses with their flags. So that proved to be a really good thing. So later on we found several improvements in performance because of this. They also asked us to add some more flag support inside iOctl. So this is scan iOctl, which is generic. So they asked us to add five or four more flags into it. Like soft, we are interested only in soft deity from our side, but because they need support. So we just added it. And also they wanted some filtering support. For example, they will pass specific flags that they want all the present pages, which are file-based. So they can just pass it and they can just find those pages instead of wasting their time in user space. So that would be more efficient. So we started returning compacted data. So we return data in form of ranges and flags are also returned. And this also makes us return less data from kernel to user because specific copying mechanism is used to copy data from kernel to user. And there are also some limitations that because we are working in memory management. So memory management controls all the memory. So we cannot write to user memory when we have acquired memory management's lock. So in that case, we have to make a copy of that data inside kernel. So we are supporting these flags right now. So WPL-R tells us that async write protection is enabled on this region or not. So these flags are returned per page. And when we return data from kernel to user, so we consider that every page is a normal page. For example, if we talk about transparent huge page, which is normally 2 megabyte size. So it is considered to have 512 normal pages. So in that case, we return 512's address range data. And this page is return flag is our software alternative. Or you can say return to flag. Then we can also find out page is file-backed or not, present, swapped, and so on. These flags can be found by page map file, right? But this iOctl is adding better way in terms of we are adding more operations, which are not present already. And we are adding a filtering mechanism, which can help in such a way like if there is some application which has very sparse VMAs. For example, there are a lot of examples like KACN. So that is a tool used to monitor memory leaks. So it creates shadow memories for all the allocated memories. So number of VMAs increase a lot. So in that case, they were scanning the entire address space from user space by reading page map file, because there was no other mechanism. But now they can just call this iOctl, and they will find the exact addresses of those pages. So we added some masks. Inverted mask is that if we don't want a particular flag, then we can just set it to one. And category mask is those pages, those flags which we are looking for. And there are some other masks, like if we are looking for a page which is present or swapped, so we can use this any of mask. And there is then return mask as well. So we are also taking care of backward compatibility in future. So we are keeping track of size of struct inside our argument, because in future maybe someone wants to add another flag inside this iOctl's input. And this is just for future's purpose. And we also added this max pages, which can be optionally specified to find only x pages of interest, just like I mentioned, that we needed to translate, get right watch. And we also added another argument which is returned from kernel is that how much address spaces have been walked, for example, if there are 100 pages of memory. And we want to find only 10 pages. So we will walk, let's say, when we reached at 50 pages scanning, we found out the 10 pages. So we will return from kernel to user. And so this 50th page or 51st page walk address would be specified into walk end. So this is the argument. So iOctl can only take one argument. So we are taking structures that are used there. So these are all the arguments, like size I have told that it is for future-proofing the feature. Flags are the operations we want to perform, start and end the addresses, which we want to scan. Walk end is returned from the kernel. And back length are the array and array size of the ranges we want to get from kernel. Next stage is specified, which is optional, to find out only some pages from all the memory. And they are the mass mess. So definitely we are expecting a lot of performance boost here. When we are going to compare it with the user space because there is not much communication going on between kernel and user space. So initial versions had some bad performance because we were reusing a lot of internal infrastructure of user fault empty. So in that case, we were getting very bad performance because TLBs were getting flushed every time we were right-protecting a page. So later on, we had to write our own code to do that our own because we wanted a very simple way to do that. And also, we just wanted to set one user fault at the right-protect flag. So that gave us a lot of performance boost. And also, we had to do a lot of passes, multiple iterations to tune the best performance. For example, there were a lot of things like we need to keep track of that output data in ranges form. So we have to iterate over a range of address range user want us to do. And also, because we cannot pass our user buffer inside, we cannot write up to our user buffer inside memory management logs. So we had to see how that would work best. So we are using 12 kb of internal temporary buffer inside kernel to keep track of all the data from inside the scanning. And then we just copy that to user. And then scanning goes on. So these numbers are just way too good because we are comparing with user space. So not a very good mechanism because we are enabling things here instead of making something better. So we are looking at three things here. First one is that when this right protection and data is being returned to pages. So what is the speed of data getting returned to that page? That is our blue, blue graph, blue in graph. And then we are measuring time to emulate, get right watch, and reset right watch. So these are not very exact benchmarks because original implementation of get right watch and reset right watch would come from wine developers. So there would be a lot of other things which they will have to handle. So these are the simple use cases. We are comparing it with mProtect and six segue. But overall, it seems like really good improvement because we are just adding more features. So right now we are at version 33. Some patches have been already been reviewed. Almost four or five developers are involved. But at version 33 still, we have not gotten any reply from it in the last three or four weeks. Maybe because of merge window as well. So let's see when we get some news. It can be merged at any time or it cannot be merged at all. So we are not sure until we get any reply. So there are some games as an example which really use this right watches so they can get benefit of it once it gets merged into kernel. And then wine is also starts sporting this. So they will switch from mProtect mechanism to this page maps can I opt which would be very clean mechanism. So let's talk about problems which were faced reaching to this point. So core kernel, especially memory management, whenever you are trying to do something there, so it is somewhat complex. And you also need to do a lot of testing to make sure that everything is working fine, every memory type here and there. And even after doing testing everything, there may still be some race condition which you are not handling. So testing is the most important thing. Even though I was writing self-test along my development, but still there were a lot of different things which you found out later that I had to write some randomized test, intelligent kind of test to find out if there was something missing or some corner case was missing. And original developers of the soft dirty which we were expecting would get involved and we will reach to some point sooner. So they were not available or maybe they have moved somewhere else. So this soft dirty feature originally was added around 2013. So they were definitely would be at different places these days. So it was really difficult to find people to review things or give us suggestions. And there was some lack of timely reviews and feedback on mailing list. Maybe these pages are not very interesting for other kernel developers to review and comment. So maybe that this has also slowed down the development overall. So you can definitely contribute by just finding these patches on mailing list and you can test it or you can review them. So it will help a lot. So we have also added documentation in the patch series so that it is easier to understand for other developers. Almost 100 tests, case health tests have been added. There has been also thorough testing being done, was done by some mind developer. And the main pages updates are still pending. So I'm waiting for the patches to at least get accepted to MM unstable and then I'll be, I'll just write the main pages. So it takes quite some effort if someone asks you to update the patches. So you need to update the documentation as well and test as well. And then I would have to update main pages as well. So I have just skipped main pages for now. So this is all, any questions? Okay, yeah, so to run Windows application on Linux, Vine is used as a translation layer. So I described that right now the translation of these Windows APIs, GetRightWatch and ResetRightWatch is not very efficient and also it has problems in some particular cases. So we are trying to emulate, translate that into in a better way inside Linux ecosystem. So once this goes into the kernel, so Vine can translate those API calls in a better way. So that would help Vine applications, Windows applications and also it has potential to help other things like CRIU is just one example. Yeah, yeah, Vine was the initial thing, but then we found CRIU, they are those developers are very interested in this feature. And this is just like once something goes in, then other users would come along. So these are just two examples. I didn't have. So it is all about culture at Kulabra, like they are senior developers and very good management and the client you are working with. So they're very cooperative, they know how upstream works. And also there are some subsystems or some particular components in those subsystems which are people are not much interested, but still you need to push it because you're just interested in that. So sometimes when there is silence on mailing list, we just go to some other thing, but it just keeps going on. And there are a lot of suggestions upstream. So when there were only two developers involved, so there have been different developers involved at different phases. For example, initially when we were sending one to five, there were some other developers which were involved. So they were giving us suggestions. Later on, some other developers came which were not interested earlier and then they were gone and newer developer came and whenever someone gets involved, they give a lot of suggestion and you need to deal with them, their suggestions, see how things work and sometimes you just have to do whatever they are saying because they are not much reviewers. So when there are a lot of reviewers, you have the choice to discuss a lot and if one agrees, one doesn't agree, you have the choice. But if only one developer is reviewing, then you will have to do whatever the reviewer is saying. So it is all about patience. Well, you can catch me later. So thank you so much.