 I think we've got someone in the room who's going to be in the room and any stragglers are late, so let's get started. Okay, sounds good. So, I'm Khalid Aziz and I have been working with Matthew on a concept to share page tables across processes. And please feel free to ask questions as they come up while we are talking. So with that, I'll hand over to Matthew to introduce the concept. I put up this, I drew this little diagram last night to illustrate the differences between what we already have and what we intend to have going forward. What's the difference between what we do and what we do with M-Share? So on the left there you can see a pair of single threaded processes. There are two M-M structs. Each of them has a task struct which points the M-M struct straightforward. In the middle there we have a multi-threaded process. We all understand what a multi-threaded process is, but at this point I would say. That is, we've got three task structs, each of them pointing to a different part of the process because we're all executing in different parts of the multi-threaded process address space. So the new stuff is on the right. We have two independent M-M structs. Each happens to be single threaded. They don't have to be. They could be multi-threaded. But they have this M-Share region. And this M-Share region points off into a third M-M struct which has no task structs pointing to it. It's not a thread. It has no threads. Okay. So let's look at what happens. If you could start clicking whenever it seems appropriate. So the first single threaded process maps page using MAP-Shared. Hey, so we've got a little red box. Now any modification that makes to it is going to be reflected with anyone else. Like in the second process which also calls M-Map-Shared. Great. Now either of them can modify it and they'll both see the other's modifications. Right. And then the third process, the multi-threaded process also maps it MAP-Shared. And so the difference here is that if the first process calls M-Protect to make it read only, it will still see the modifications from the other processes. But if the multi-threaded process calls M-Protect, all three of those threads will not be able to write to it, right? So the other processes are deciding on their own ideas, but deciding for themselves what their protections are. But the multi-threaded process, one of these threads can decide for all of the others what are the protections on this because they're all sharing the same area. Okay. So now we're going to have the M-Sharing tasks call M-Map. Click. Okay. And they all appear all at once, right? This is mapped into everybody's address space all at the same time because these are literally shared page tables. And so if any of them calls M-Protect, again, that happens to all of them just like with the multi-threaded process except that in the purple areas, things aren't shared, but in the green area, everything is shared. And I apologize if anyone is colorblind. By the green area, I mean the stuff that's labeled M-Share region. So we're not talking about processes that don't know each other. We are talking about processes which have decided to cooperate with each other. We're talking about maybe a web server that has decided it's going to have some shared chunks of memory and some not shared chunks of memory. Or we're talking about a database maybe that has some very special and unusual requirements. But it's not just, you know, for our employer, right? I think this is a generally interesting thing to do. Holly, do you want to take it away and talk a bit more about the implementation? Of course. Okay, so as Matthew talked about, there's already shared page tables. Why don't we have cooperating processes, shared page tables as well? And one of the big benefits we see is that when you have got multiple processes sharing the same pages, especially when you have thousands of processes, we are wasting a lot of space in each process keeping its own copy of page table entry. So in order to solve that problem, what we are proposing is the M share mechanism which is an operating mechanism because the processes have to have certain level of trust between them. So a process says I'm creating an M share region and I want to be able to share it with everyone else. And there will be some authentication mechanism, essentially permissions. So whichever other process has the permission to share that region will be able to share it. So the process that's creating the M share region informs the kernel, then it starts mapping objects in that region. And any other process that has a permission to access that M share region can then map that region into its own address space and get access to all of the objects that are currently mapped in there and then be able to read and write to those regions. So I have sent out a couple of past series and we started out with an API that is a combination of an in-memory file system and system call and based upon the feedback, I'm now working on an API that looks more like this. The implementation adds a new in-memory file system M share FS. So you mount this file system and when you mount that file system, now you can open files into that file system. Opening a file into that file system creates an M share region and the name of the file is what you refer to the M share region as for the other processes to be able to attach to that M share region. So the process creates a new file in there and then it can M map the FTA gets back from the file system. When it does the M map, that's when we know what the extent of that region is like starting address and the size and at that point the region is defined and now another process can come in, open that same file in M share FS and then M map that FD in its address space. So now all of the objects that were in the M share region become visible to this new process and when the process is done with its interest in the M share region it can simply call an unlink on the file and when the last reference to the M share region is removed, the file is removed. So I have started out with an initial implementation where the focus is on getting the core functionality working and then expand it as the need arises. So one of the questions is what is the size of the region, the minimum size of the region you want to share? I have started out with a pager size which is fairly large but it keeps things simple for now but we can potentially look at sharing at part of PMD sizes as well. So what that implies is a process that wants to share an M share region and it wants to map the M share region in its address space and it has to know what is the size and alignment requirement we are looking at for this specific region. So when you mount M share FS it will populate a file M share info that will provide that information. You just open that, read it and you have the size and alignment requirement. So once the process has created a file and in effect created a new M share region what happens is the open causes a new MM structure to be allocated. This MM structure as Matthew pointed out is not assigned to a task. It is stands on its own. This MM structure is primarily used to hold all of the VMAs that represent the M share region and it also holds all of the page tables for that M share region. And then when the first map happens if the process that is doing that MM already has objects mapped in that region will copy over all of the VMAs from the creating process into this new MM structure and then we set a new flag VM share PT on each of the VMAs that now correlate with the M share region. And then currently I also change the VM MM pointer to point to this new VMAs, this new MM structure. So I have a picture, let me bring that up this way. So you have a process. It has its own MM structure. It has got a bunch of VMAs in there. Each VMA will have its VMM pointing back to its own MM structure except for the VMA that maps onto the M share region and that VMA will have its VMM pointing to the MM structure for the M share region as well as the VM shared PT flag set on it. So when page fault occurs, we look up the VMA where the page fault occurred and if the VM shared PT flag is set on that VMA we know it's an M shared region. So we go to the M share MM instructor instead and continue the page fault handling from there. So as a result we will bring in all the page tables from the shared region and that's what we'll operate on all of these VMAs. Okay, so I have a basically working implementation but there are still a lot of loose ends that should be tied up and questions that we have that we are trying to answer as we go through this process. granularity, I already had just said something to look at API as we have today. One of the questions is does that do the job? Is it sufficient? Is it reasonable and maintainable? And then when we look at M mapping these M share regions currently our intent is to allow M mapping only the entire region but should we be looking at allowing M map of partial regions and then the same thing happens even on the M unmapped when a process does an unmapped do we force it to unmap the entire region or can it leave some regions still mapped in? Then also how would a process map in new objects? So we need to look into how to support M remap when the process that created the original M share region it potentially had objects mapped in there but another process that's not sharing that region could potentially unmap the existing object and map in a new one. So we should possibly support that. There's a question of do we allow these M share regions to stack on top of each other? Which makes it very interesting and then we have potential interaction of these regions with user fault FD. How do we support that? Or do we say no, you can't use user fault FD with the M share regions? Any questions at this point? Yes. Can I get a microphone? Okay, I've got a microphone. So I still haven't rubbed my head around this thing. It sounds scary a lot because we have that natural kind of one process has a MM struct. So what happens, who is in charge of that kind of other space? So how do we do accounting? If you try to M map fix into a area that it's conflicting in what happens or what happens, does it happen to everybody who is sharing that MM? So there are very, very many questions that I... I don't know where to start. That's entirely fair. I mean, I think to a large point is it's the same as if they were threaded, like fully threaded. I don't think there's been partially threaded, right? These two processes are somewhat threaded with each other. They shared that chunk of their address space and so for that chunk of their address space they behave as if they were threads of the same process but for the rest of it it's independent. They're independent processes. Yeah, right. I'm not really sure that's answering questions. I think it answers your map fixed question. Yes, so let's say that you are map fixing into that a conflicting area. Does that really mean that you are map fixing to everybody sharing that thing? Yes. Okay, I thought that clone VM without clone thread or seek hand or what's that combination when you essentially share your MM without sharing a common understanding of signals was the biggest sin. This seems to be going beyond that. Yeah. Right. So we do have two disjointed processes that are sharing parts of their address space which is why there are more questions and one of the questions on the next slide is security. What kind of security issues this raises? So you are right that there are a number of these issues that we do have to address as we go along. So in my mind the big question is is the concept useful and is the way we are thinking about it reasonable and then can we address the potential issues that come up? Regarding the API, is file system a must? Maybe a system calls it returns file descriptor and then cooperating processes can the same rights share it and it will do... That was my original design. And the original API did not survive first contact with the enemy I mean the user. They did not enjoy the STM rights approach. They want to have a name in a file system. And having a name in a file system it gives you a little bit more... It gives you the enumerability which M share regions actually exist which otherwise with just FDs I mean you can use F user I guess but there's a lot of scanning to do to find out which shared regions exist. Right and FD based API one of the issues that was pointed out with that was now it becomes a client server model so we have to have a server that holds that FD that can pass it to clients using STM rights and now we have built a bottleneck. I just wanted to add about the API that I knew there was evil underneath but the API struck me as a really elegant thing it looked like a version 2 after you already considered the usual approaches so I was shocked and refreshed to see the API so if you do it please my vote is for that one. Thank you. Dan's got a question to him? I see some questions in the chat people are asking about does it work with Gazer pages fast how do you do the accounting for pinning pages with this? So right now I'm focusing on the core functionality and get user pages fast is on my list of things to address but yes the goal is to make it work with that as well. So in general I think this is a pretty interesting concept but I agree that it's scary so I was wondering if we could come up with some kind of whitelist what you can actually do with something like that meaning for example if you want to do an M lock like no that's not going to happen if you want to map private memory that's not going to happen until we fully explored the problem space so you would actually start with a whitelist that is fairly small but gets the job done and as we go we actually unlock features and once we know what we're actually dealing with and that might be like one part of the issue to page pinning we don't support it there for example maybe that could be one approach to move forward because it is scary I mean I'm not going to lie Our customer's requirement is with DAX they didn't want to have terabytes of memory but they were M mapping and using more terabytes of memory for page tables and they were getting in actual storage because if you do the calculations with about 10,000 processes that's where you end up about twice as many bytes used for page tables as you do for actual storage so that wasn't great and it totally understood why so the original request was hey give DAX the same functionality as huge TLBFS and we all kind of got a bit twitchy because everybody hates huge TLBFS and so it was like okay what can we do instead what can we do to make this not awful and this is what we came up with something different awful now I don't know I think this is kind of interesting so we're interested in the concept of we have an MM struct with no threads attached to it but we do have a file descriptor for it so now we have a file descriptor for an MM struct and what can we do with that concept like could we make system faster or better because now you could implement the system libc call by saying okay start create a new MM struct and then start adding things to it and then say okay now you have and then breathe life into it and have it go off and be that new thing rather than forking and copying all your VMAs and copying all your file descriptors if we had a way say okay let's create a process over here it doesn't have any life but we'll inject file descriptors into it and we'll do various things to it and then set it going I think that could be a really interesting concept Sounds to me like Frankenstein's monster It sounds to you like what? Like Frankenstein's monster Well yeah I mean you kind of saw it but clone is like that right you saw a process together from various different things I think Frankenstein would have loved us Right? A misunderstood genius just like us Maybe if you come up with a new model and a name for this instead of just limiting it to there's a new mechanism isn't it fascinating but maybe go one step further since you can't just have random stuff floating around without a model really because then it's too hard to think about security but if you said well this is a new thing let's call it I don't know a lightweight process or a Frankenstein process or come up with something and then here are the rules for that and they just happen to match exactly what you're doing here that might be more acceptable even though it seems bolder I think it would help So my question would be as I said I cannot really comprehend the consequences of this because this is just beyond any imagination but can you explain what's the actual you said that you really hate huge DLBFS but can you explain why this cannot be done through a mapping so that you actually have something that is really specific to that at this range for example so why cannot you really share on the per mapping basis so essentially just abstract what the huge DLBFS is doing in somehow more shareable way I mean that's ugly right but that's level the level of ugliness is somewhere else than an unbound number of MM structs in something that barely resembles a process okay so when we are looking for different semantics when you start to talk about sharing page tables you go back to the different semantics that you have for example M-Protect and the customer does in fact want to do M-Protect and it wants the semantics I described that if you M-Protect it in one task it becomes M-Protected for all tasks that are sharing it that's a feature not a bug they want to literally share the page tables they want it to be as if it were threaded for that chunk of their address space they just don't want to use threads because M-MapsM sucks and obviously we are trying to fix that in a different way but I mean for now M-MapsM sucks so I don't know that yet we definitely don't want to make it so that if you call M-MapsMapShared you magically get these new semantics because that will break existing programs that are not expecting to say oh somebody else called M-Protect and now I can't read this or write to this page that's got to be opt-in semantics which means at least a new flag to M-Map if not a new syscore and so I thought it was clear to be more open about we are going to share this region of the address space than it was to say if you use this magic flag to M-Map then the chunk of your address space which contains this mapping is now shared with everybody else exactly this makes it very explicit that the process is choosing to do that as opposed to a side effect of a system called that they were already making before so it becomes a deliberate decision to opt into this mechanism so how many of these shared regions are supported per process an unlimited, unbounded or is it just the one? there's no particular limit I mean you can create as many files in M-SharedFS as you want that wouldn't be typically what I would expect to see a program do what I would expect to see a program do is say I want this very large chunk of my address space to be shared and then they can map several smaller things in it and they can be a mixture of map private, map shared files and on you can put all kinds of things in there but you have to choose to put them in if you just call M-Map and you don't specify anything about the private bits of your address space it doesn't sometimes invisibly push into the shared region because that would be a horribly stupid thing the unbounded question is kind of why I thought of this would it be better to just have one possible M-Share versus an unlimited and maybe not have the VMA pointing to it but some other mechanism to say that I would and a special VMA of some type that just blocks it off as opposed to I'm just thinking of what the VMA M-M struck pointer is used for and how entangled into the system that is right now how much will change or will need to change to support this right so just to amplify what you're talking about here is something like we tried to do a write into a read-only VMA and now we need to send a signal but we stop at the M-M which is shared and now which process do we actually send the SIG bus to that kind of thing yeah and who owns the thing and also find VMA rather than find M-M Jason has some comments Jason would you like to go live and explain what you're saying or should I just read that can you hear me yes yes we can oh wow thank you I was just wondering that you're listening to you talk about the use case for DAX I was thinking maybe there's some way to to like make the DAX files hold the page tables basically you're shaking your head I can see you on video okay so that that's kind of what we're considering doing for huge TLBFS right now huge TLBFS goes around and looks at other processes see if anybody else has it mapped and shares their page table with them rather than storing the page table pointers in the iNode so I meant to suggest this to Mike but I only thought of this this morning when I was out for my run and haven't had the chance to chat to Mike yet so this is probably quite a surprise to him but but it seems to have sort of a conceptual appeal like if you could say when I instantiated VMA and I go to install all the pages like maybe instead of installing a bunch of pages maybe I just get one like a p4d or something like that something big and and it's just owned by the VMA or the iNode or whatever mechanism is underneath and all the reference counting kind of follows that so you just install the one thing and you know it follows along naturally you get the file system access that you wanted from your customer yeah so essentially the huge TLBFS concept but somehow obstructed right yeah I think we could do that but but again I think we still need some opt-in from the process to this because they need to understand the the new MProtek semantics it can be the sorry Jason go ahead I'm sorry I was going to say you have to be like a new MAP option like MAP shared protections goodness we have so many MAP flags left but I think the the opt-in and the API is not really necessary tied to the necessity to have additional MM struct so you can have the M share API like Khalid suggested and then create some different kind of object of full-blown MM to just link it from VMAs that are shared and to signify that there is the sharing mechanism going on there that is potentially possible keeping creating a different separate MM structure does simplify being able to use that existing mechanism for managing all of the page tables it's an abbreviated MM struct that we are using just as a container because we have a simple way to manipulate it using existing functions I mean we did add the MAP MAP shared validate flag to allow for MAP sync so like we actually get submit we have maybe I think we still have room for there to have a MAP flag that will fail if it's not supported and work so if you want to do a new MAP it's not I don't think it's impossible okay that's something to consider your thoughts Matthew you know I'm if people are just too scared of this feature then we can go back to doing something else I just thought it was a really interesting idea to say you know we will take these two processes and be partially threaded between them I thought that was worth exploring but you know if it turns out to be in search of a user then you know we can do something simpler okay yeah how about this I'm getting ready to send out the next version of the patch in the next two weeks or so so feel free to comment give us more feedback and we'll tear it apart from this Jan you wanted to say something yeah I just wanted so Matthew said the reason by the application doesn't fully go through it is the MAPSM but then we have actually very similar problems with contention on MAPSM of the like shared mm yeah because basically where we have this M shared region then basically we have to have some synchronization again of changes which are presumably going to happen under MAPSM of the like share mm struct and then we have basically the same MAPSM contention issues there as we would have like in the threaded case or thank you Jan I knew somebody was going to bring that up so thank you for doing it to a certain extent yes to a certain extent no I mean these processes don't just have the big M shared region I mean if all of the M protect M lock M unlock whatever activity was happening on that region then yes you'd be right but they also have that the other stuff going on which actually is private to that thread and so effectively split the MAPSM in half half and if it's like 80-20 in one direction or the other then you know you're going to see some amount of improvement or dis improvement but the key is that they get to choose they get to ask for it yeah and so to that is actually second question so instead of like providing this M shared region wouldn't it be better to like provide the way to a thread to set aside part of other space where you know it would be really thread private and then we would we could do simple locking there like without all the MAPSM denses essentially because we know it's thread local man it's like talking to a version of myself from 12 months ago yeah we absolutely consider that and we talked about that but we ended up deciding that well if you let a thread have its own MM struct isn't that essentially the same thing are we just doing the same thing but in reverse well I think conceptually it might be easier to grasp the situation that you know this range of address space is thread private to a particle thread and we don't do MAPSM locking there essentially then it is to grasp the concept of shared MM struct yeah I guess I kind of I'm sorry but I have to cut it here because we are overflowing to the next slot and this seems like more questions we ask the more questions we get so yeah it's definitely interesting but that's fine we can always continue this discussion on the next slot right right thank you all thank you