 So let's get started. For those of you that don't know me or have forgotten my name, I'm David. I'm with Rattat, I do memory management stuff and yeah, usually also fairly close virtual machine related memory management stuff. Today I wanna talk about something that's not really directly related to virtual machines but still important and it has the magical name of Goop and Cow, get user pages paired with copy and write but specific to anonymous memory only. I have a lot of stuff to cover. I'm not gonna most probably talk about everything. I'm gonna talk about the most important stuff and if we have more time we can actually dive into the details but we're gonna focus on some background history and then the approach that is currently being upstreamed. So let's talk about some background in history. So whenever we talk about copy and write with anonymous memory, we actually want to share an anonymous memory between processes during fork for example, another example would be KSM and what we do is when we share an anonymous page between two processes is that we have to map them read only into both processes and as soon as one of the processes tries to write to that no read only map page we have to decide what to do. For example, we have to copy the page or we might be able to decide oh yeah, like there's no need to copy or can just reuse this page because like there's nobody else sharing this page anymore with me. And the simplest condition you can think about like whether you have to copy or where you can reuse is to take a look at the map count or if the map count is bigger than one that means that there is definitely somebody else sharing the page so you have to copy and you cannot reuse. An example below, you have this example two page tables map the same anonymous page and as soon as somebody writes to it actually gets copied and can then of course the copy can then be used read writeable. There's also ones like fairly easy I would say unless we wouldn't have get user pages. So get user pages I assume most of you are familiar with that. It's a mechanism to get a reference on a page and usually we want to then use that reference to access the data stored in the page either directly or eventually later and it comes in multiple flavors. Some of the flavors are for example a forget which means actually like we only want to access the struck page and not the page content which isn't true it's documented that way in the kernel but it's not true because for example O-direct uses forget excessively. Then we have full pin which essentially means I wanna take a reference on the page and I wanna really access page content and usually we use that only in context of user, user references for example when we have to deal with VFIO or RDMA IO U-Rings fixed buffers and something like that. We can have read write references, read only references and in general when we talk about get user page references there are counted via the page count and not via the map code obviously because the map code only tells us like how many processes are actually mapping this page. So in general when we only take a look at the map count and the page count we don't know too much about get user page references so for example map count and one but the reference count bigger than one like what is that other reference? Where is it from? Is it a read write reference, read only reference? Is it a good reference? Is it just like some other reference from the operating system? We don't really know that. So in general the question is can both things paired be problematic and I wouldn't be here if it wouldn't be problematic so it is and the original thing that kicked it all off was the copy and write security issue. I'm gonna walk you through that briefly so there is something called VMSplies which allows you to take a read only reference on the page and map it into a pipe and later on when somebody actually reads from that pipe it will read from the actual pages so there will not be any copying of data involved but this actually allows you to do something very nasty so you have some anonymous page in the parent you store some boring data in it and then you call fork and the child what it can actually do is it can take such a read only reference on that page and then simply un-map the memory so what we left with is a page that's only met once but there are additional references from another process and as soon as our parent stores some secret data into that area our child can actually observe that by just reading out of that pipe and that's not good because our child process might have less privileges than our parent process so you could like read secret data from your child in your parent process. We take a look at the history of the whole thing. It was reported in May 2020, resulted in a CDE. It was fixed by some commit that fixed broke a lot of other stuff, we later reverted it. Then there was another fix essentially that was called copy and write simplifications that fixed security issue for ordinary anonymous memory but it didn't consider for example to swap page transparent huge pages and huge to be. So on the one hand after reverting that like problem at a commit before that we actually had something that was semi-broken and semi-secure so the interesting thing is that the copy and write simplification resulted in some very subtle undecided side effects and we're gonna talk about them. Long story short, do swap page and transparent huge pages are now the sixth upstream. They're just like in the upcoming kernel release they're gonna be fixed such that there will not be a security issue. However, I call it like now most of the stuff is consistently broken. It's not like that one thing is differently broken, everything is consistently broken because like we have these undecided side effects across almost all anonymous memory. Huge to be is still affected upstream by the security issue and it's not that easy to fix with the current approach but we in general don't care too much about huge to be. I would say because like the usage of huge to be for sharing memory between a privileged child parent process and unprivileged parent, no. Privileged parent, privileged parent process and unprivileged child process rather rare. So that's certainly something to take a look at it in the future but it's not like an immediate issue I would say. So let's take a look but upstream fix starts to copy and write simplification. Essentially we say oh yeah like if the page count is not one we're not gonna reuse the page. That means like we're no longer gonna take a look at the map count often as this page map but we say oh yeah if there are any references that we don't expect which is like more than one we're not gonna reuse the page. So in general all we care about is the reference count now to make a decision. I think this has a nice property because it means that we cannot have a security issue anymore because like there can only be one reference that's from the page table we're looking at right now there cannot be any other references. But obviously that also means that whenever we run into this code and we detect the page count that is bigger than one that we have to copy and that's then the real issue because we might actually copy it too often. Here's the example of the security flow and how it's fixed. If we just take a look at a rev count it actually means that at the point in time that our parent would want to store the secret data into that shared page so to say that has been up in the child. We would now have a reference count of two. Copy and write handle says yeah there is like an additional reference I'm not gonna reuse that page so we're gonna copy and write and our child would not be able to observe the memory modifications on the parent. That is the nice part. Let's talk about the undecide side effect. So this is a let's call it not an artificial example but it's hard to come up for example with the pin user page user that you can just like write in a C program. So meaning for example if you take a read-only reference in the parent for example via RDMA, VFIO, VDPA and something else. What can happen is that if you take a read-only reference on a page that's currently marked as shared or possibly shared because it's met read-only you suddenly have like a page that's met read-only into your page table after our pin user pages the reference count is actually bigger than one. So whenever we now simply write to our memory the copy and write handle says oh yeah well this page is met read-only it has a reference count that's bigger than one so I'm gonna copy. So what happened here now is that whatever we write into our page from our process will not be observable by our pin user page user which means that we have a disconnect between our for example secondary MMU and our primary MMU because in this scenario we copied although we shouldn't have copied. So in general when we take a look at that approach that means that we have two issues. On the one hand we might copy although there is no need to copy and it's simply unnecessary so it doesn't do any harm but it might result in some performance degradation. On the other hand what we such as now is that we copy although we really shouldn't copy because it results in inconsistency and that's then something that I refer to as a wrong copy and write. And yeah well the issue is here that Goop and the memory management code lose synchronicity and for example in what we such as in the previous example the Goop user will lose for example that write access from the parent process. The nasty thing is that this happens essentially whenever our page gets mapped read only into the page table. And unfortunately there are various ways to get a page mapped read only into the page table. One example is if you swap out a page or you unmet the page and you do a read access then it gets refolded from the swap cache read only and you run into that exact issue. But we also learned that for example NUMA hinting code can for whatever reason lose the write protection on a map page so suddenly in your NUMA hinting code you would lose write protection of the PDE. The page would be read only and as soon as somebody writes to the page you would actually create a disconnect between Goop and the pages that are mapped into the page table. So the approach that I'm upstreaming right now when I wanna raise awareness because maybe we can have more people reviewed that code which would be pretty good as something called page unknown exclusive paired with something called unsharing and paired with the ref count logic. So our copy and write handle was still rely on the ref count but we tweak it in a way that it doesn't end up in these wrong cows so to say so where we really shouldn't copy and write. That at the same time we try to mitigate a little the performance issues that we've been seeing. So in general page unknown exclusive it's a page flag and what it expresses is this anonymous page that I'm looking at is it exclusive to a single process or might it be shared? So if it might be shared it means like it could be exclusive, it could not be exclusive we don't know but once we see that flag said it means like this is really an exclusive page. Then we have a set of rules of how to actually apply this whole thing so for example whenever we map a page writable into a page table it means because it's writable it cannot be shared as per definitions or it has to be exclusive. What we're not gonna allow is in our copy and write handle for example to replace a page that's exclusive so if our copy and write handle stumbles over a page that's exclusive we're simply gonna reuse it. We're not gonna do any kind of checks on a rev count because we know this page is exclusive and it hasn't been shared with anybody else. The other hand in our get user page code when we pin a page that's anonymous we're only gonna allow to pin it if it's exclusive. If it's not exclusive we most probably have to trigger unsharing and I'm gonna talk about it next. And the last rule so to say is that whenever we have a page that's anonymous and exclusive and we might want to share it for example because we fork or there's KSM coming around we disallow that if the page may be pinned. So that means as soon as the page may be pinned we're by no mean gonna share it with anybody and on the other way around like we're only gonna allow to pin a page if it's exclusive and that actually gives us quite some nice consistency guarantees between on the one hand what does a pin on an anonymous page mean for example that it cannot be shared and on the other hand that when we take a look at an ordinary process that doesn't fork all pages are exclusive or page fold handler or a copy and write handler will be extremely fast because for example we don't have to take a look or have to take the page log anything like that we just thought oh yeah this page is exclusive I'm gonna reuse it and so what happens if we want to pin anonymous page read writeable so we want to take a write reference it's fairly easy so if the page is already mapped writeable in our page table we can just go ahead and pin it because like if it's writeable that implies that it has to be exclusive if the page is mapped readable it's the same thing as we used to do always like taking a right pin on a readable map page means that we have to trigger a write fold so go ahead into the write fold write fold handler will trigger eventually a copy and write and will be left with a page that's mapped writeable and we can just pin it now the interesting thing is when we want to take a read only pin on a page if it's mapped writeable again that's not an issue we know if it's writeable it's exclusive we can just pin it if it's mapped readable and it's exclusive it's also not an issue because so Ruth says like we only pin exclusive pages so we can go ahead but in the special case that our page is mapped readable and it's not exclusive we have to trigger something that's called unshearing and unshearing is essentially a copy on write but without mapping the page writeable so it looks like a copy on read so essentially we want to make sure that whatever shared page was mapped there afterwards we have an exclusive page and it's very simple to our right fold handling so for example we might detect oh yeah this page has a single reference so we can just reuse it so we are gonna set the pages exclusive and we're done but of course if there are more references we don't know if the page is eventually has some other references on it that we're not able to deal with so we'll have to copy and when we copy we're gonna map the page read-only into the page table because it's unshearing and it's not a right fold so it's essentially a read fold that triggers unshearing. You can see in the example below we have this scenario again that we have like two pages mapped read-only but this time we have a read access on one of the pages triggered by user pages. What we're gonna do is essentially the same thing as we had in the copy and write example but here we're gonna leave the page mapped read-only and not read writeable. So if you take a look at some simple examples this is the example that we had before where essentially the current upstream logic was messing up and was creating this disconnect. Did you want to say something? Yeah, just could you go back? Sure. Oh yeah, so when we talked about this earlier I strongly suggested not saying copy unread and I still do because that's not what it is. It's a cute name that is completely wrong and so we should say something else. Your original description that doesn't include those words is perfect. So the code doesn't include that. I'm gonna state that I found some other operating systems that use the word copy unread for something similar. But yeah, it's more like the interesting thing is like we talk about copy and write, copy unread but actually it's not like a copy unread, it's copy on, group triggered un-sharing so it's un-sharing, that's the official word. Yeah. Is it called copy on access or? Not really access because it's not like that the CPU is accessing it and triggering it. It's really only from like get user pages code that wants to pin a page and that's why we refer to it as un-sharing just as generic. For example, we might have other uses of un-sharing in the future. Eventually, for example, case M code might want to trigger that in some scenarios. So get user pages is like the most prominent example. But the big difference here is, and I agree that copy and read is a bad name for it. It somehow makes you assume that this is something that could also be triggered by the CPU on read access but that's not what happens. It only happens when like group wants to take a read only reference. So if we take a look at our example, it's essentially when we took the pin user page example previously and then the men set operation, the men set operation would have triggered a copy on write because the page was shared and we would have messed up. In this example, as soon as we do the pin user pages, read only pin essentially, we trigger un-sharing and at the point in time where we trigger un-sharing, we can actually write to the page without any issues and we're done. And un-sharing here in this example simply means because there are no other references because our child quit means that we simply set a page as exclusive, group code can go ahead and pin that exclusive page and we're done. If we change that example a little so you can see like the child quit goes to child doesn't quit, we actually still have a page that has more references than expected. As soon as we trigger un-sharing, we now have to copy the page and our parent process will then work on the copy and we also don't have any inconsistency. And last but not least, this is an example where we take a read only pin before fork and it also works as expected because when we start at the very top, we have a page, it has a reference count of one, it's exclusive. We write to it, that's not gonna change anything. We take a read only pin in that example. The page is exclusive, we can just go ahead and pin it read only. So the reference count will change. The interesting thing now is to fork, to fork we'll detect oh yeah, this page may be pinned and because the page may be pinned, we're not gonna share it. So we're not gonna turn the exclusive page into a shared page. So we're not gonna share it with our child, our child will immediately receive a copy and our parent can just continue to work on that and everybody's happy. Any questions regarding that? Perfect. So of course it would be easy. I'm just gonna mention some tricky bits. So transparent huge pages are nasty. The issue with transparent huge pages is that like when as soon as you PDE map a transparent huge page, you actually need that exclusive information to a subpage because our process or could do some nasty things. You could remap parts of a transparent huge page to a different page table. It could set how it's called map, map no forks and no copy, I don't recall how it's called but that you actually don't want to include a certain part of your VMA in a fork event. So you can come up with quite some nasty conditions where only parts of a transparent huge page are mapped to the one process, the other not a process. As soon as you compare that with pinning then stuff gets a little bit nasty. So we need that information for a subpage and we only need a subpage as soon as we don't natively map a transparent huge page but as soon as we map it as a PDE map transparent huge page. Another thing that is a little bit subtle is for example when we do some kind of temporary un-map. So temporary un-map is for example when we want to swap out a page we're gonna insert into our page table our swap PDE and the same goes for migration. If we want to migrate we're gonna un-map the page and insert on migration entry. But what happens if we're in that condition and we call fork? Well if we're in that condition and we call fork our fork code will simply duplicate a swap entry so duplicate the migration entries and if we had like a pin on these pages before we replaced them by migration or by swap entry fork would simply copy them so we would be in a scenario where we would have a pin on a page that's no longer exclusive but it has been shared. So we have to take care of that and essentially it is we only gonna un-map something from the page table. If we can clear the exclusive bit we can only clear the exclusive bit if there are no pins on the page which means like if we succeed in the clearing operation we cannot have any pins on that fork code can go ahead and share the page will be fine. Something else that is really nasty that I learned is if we have get user pages fast which basically takes no locks it just walks the page tables and pins the page without staring at page locks or anything else and the page lock or page table lock so this should actually be doesn't take the page table lock. So the thing is the page table lock is our primary mechanism to synchronize against clearing and testing the exclusive bit but good code doesn't care it doesn't take the page table lock so how could we possibly be able to synchronize with that because it could be we're just about to share the page doing fork but some concurrent code takes a reference on the code and we could again end up in a scenario where we have a pin on a page that's no longer exclusive. So we have to do some tricks for example we have to make sure whenever we temporary un-map a page that we invalidate and clear and do a tlb flush because the tlb flush will then actually synchronize against good fast code and we can be sure or can make sure that this scenario that I described cannot happen and the other thing is that during fork we rely on a special sequence counter that also tells good fast code like well it's nice that you took a reference on the page the pin on the page but there was concurrent fork activity so please drop that reference and try again. So also my mechanism we can make sure that we cannot end up with a page that's pinned and shared because that again would be a security issue of course. The other thing is of course when we do a temporary un-map like for migration entry and insert a migration entry we were required to clear the page on an exclusive bit to make it work but of course we might want to preserve that information that the page was exclusive and we're gonna do that via the non-swap PTE so we have special migration entries that we enter into the page table. For writable migration entries it's fairly simple because writable on an anonymous page implies exclusive so we already have that information but when we have a read-only map page and that read-only map page was exclusive we have to remember that information in the migration entry. What I came up with is simply have yet another migration entry it's called readable exclusive not the best name, not my best work but it gets the job done. So essentially you have writable migration entries, you have readable migration entries and you have readable exclusive migration entries and that allows you to handle migration accordingly and to preserve that exclusivity information even when you migrate the page. The other thing is how can we make sure that when we swap out something that we don't lose that information? Which is actually only required to get not full PIN correct but to get full GET correct and bad news is that we need an architecture specific swap PDE flag for that so we have to remember just like for example for use of soft dirty tracking we have to remember that the swap entry has certain attributes and that sense it would mean that swap PDE is exclusive which is an attribute but I only so far converted the most prominent 64 bit architectures to it like for example from 64, 390X, PowerPC and of course our beloved x8664. The other ones, I mean it can be done but yeah, let's get it upstream first and then we can think about the others and especially like about which 32 bit architectures we actually care in that sense. And any questions regarding that? These details, yeah? I do but we're right at the top of the hour so the session's about over. Right, so what I'm gonna mention here and this is gonna be the last slide is it's not optimal and that might or might not be an issue. The thing is as long as our process doesn't call fork all anonymous pages are exclusive and we live a happy life because like we're gonna reuse an exclusive page in the copy and write handle immediately. But as soon as we fork we're no longer optimal because we again fall back on the check like is the page count equal to one? Is it not one? And we can end up in scenarios where the page count isn't one quite easily for example with speculative references, things like that but most momentarily when we have a PTMAP-THP that's always gonna have a page count that's bigger than one so we'll always gonna have to copy instead of reuse and that's not optimal. Question here is do we care? Like is there workload that will care that we do unnecessary copies or not? And with the work that I presented essentially everything that uses full PIN so that you use a proper PIN on a page is completely fine. What's not completely fine yet and this is not something that was broken in between it never worked reliably is that when you take a full write on a page so you take a writeable reference that is not a PIN it will only work reliably as long as you don't fork and as long as your architecture supports a special swap PDE flag to preserve the exclusive information. So the motivation here is well then we have to convert all direct to full PIN and everybody will be happy and we can remove from, eventually from our man page of open that all direct should never be used in combination with fork because it's evil because once we fix that it actually should work as reliably as we can. I leave you guys with that. We can talk in the hallway if there are any questions. Thanks a lot.