 Okay, I think that that looks right. So I'm T. J. Alenbaugh. My co-presenter today is Juanchu, who I think should be on Zoom and maybe will just be a disembodied voice through the speakers. So, yeah, but we'll see if his audio is working here in a minute. But yeah, I'm really grateful to have a co-presenter here. Hopefully what we want to show today is looking at a single problem that occurs in two different contexts and hopefully moving in a path forward that allows us both to solve our problem. So it's nice to be able to see the two different contexts here. So Juanchu, are you there? Hello. Oh, okay, great. So yeah, you want to why don't you just take it away and then I'll be the button pusher for the slides. See this title looks slightly different. Yeah, so I'm going to and I'll be talking about memory overcommit in containerized environments with TJ. Next slide, please. Oh, discover playlists. Okay, there we go. So for both clients and servers, workloads can be containerized with VMs, Kubernetes containers, MCGs, and the workloads differ between clients and servers. The client's jobs are more bursty and unpredictable since they react to user interaction and the systems need to respond quickly to interesting events and be aware of energy usage. And the server jobs have a more predictable memory footprint and they're concerned with stability and performance. A common technique is proactive reclaim, where it reclaims memory ahead of memory pressure and it makes apparent the amount of actually free memory on the machine. So some client use cases can include a virtualized OS on desktop and tablets or isolated execution environments for security. And the servers are more concerned about SLOs for different availability tiers, proactive reclaim, like I said before, and the motion promotion between tiers. Next slide, please. So to view this problem of overcommitted memory, we want to introduce the concept of working set as a histogram. So this working set idea is a binning of pages by time or coldness, something like that, and each bin contains a page cap. The diagram on the right, you can see that the bottom two arrows there, they say idle age interval, so all the pages in the particular bin are considered idle for some time between two adjacent interval values. And this could be the active and active LRU split, the empty LRU generations, or access frequency for daemon, or even estimated through user space. So in the system we're presenting today, we collect the working set in the guest or the memcg hierarchy for a better estimation of memory utilization inside containers. This is generated on demand from reclaim activity or queried by the controller, and we use the balloon device to send the working set to the host, which enables the host to make balloon size decisions for each guest. Next slide, please. Right. So for containerized workloads, the data center and client use cases are quite similar. In both cases, we kind of have a controller listening for events, maybe once in a while probing the containers and VMs, and based on the information they receive, they implement policy decisions. So on the left hand side, we show the data center use case. Say we have three jobs, and the one on the right is a best effort job that may use some free capacity. But if a latency sensitive job goes into reclaim, the management daemon receives a working set report, it can shrink the latency tolerant job based on the report. We can also try to unkill the best effort job if it's working so it can no longer be kept resident. So the core piece of the puzzle here is the working set report for setting the precise values of the limits. And a very similar story happens with the overcommitted clients, where multiple VMs need to share fairly the system's limited resources. A fair sharing policy can be employed based on the oldest pages in each VM's working set report, but TJ will talk more about that. Yeah, so the use case, just to highlight on the right hand side, being the client side use case, might have a slightly different or additional complexity, which is that oftentimes there's, you know, a significant host application happening too that the user cares about doing some kind of native application in addition to the virtualized applications. And the user might be going back and forth between these different contexts to do useful activity. Again, maybe, you know, running an important application that's only available on a Windows environment that you want to run virtualized, for example. I think the thing that is the same about these two contexts is that there is the limited memory resource that the utilization of that resource is necessarily isolated. The information about how that resource is being used is isolated by the constraints of the system. And so we seek to find some way to get the global picture of the utilization of the memory. And so the right hand side, we would propose to do that through extending the balloon device, which I'll just give some more detail about. So this slide really shows two different components to accomplish the goal of getting the data from, you know, from the guest to the host. The first component is the notification system of some kind that you want to mentioned where we get notified from within the kernel that a new working set is, a working set report is available. You can think of this as analogous to like a shrinker, the shrinker interface where a component of the kernel, in our case the balloon driver is a client of this interface. And we kind of show it as a pub sub sort of interface where you subscribe to these notifications and then in our implementation and during background reclaim activities, you can get notified of that a new report is available. And so the balloon driver will receive that notification. And then the second component is that we extend the capability of the balloon driver to report this out, this information out back to the device. And then after that, there is typically in a VMM there's a specific functionality specific to that VMM that allows a host controller program to, you know, get this information. So in QEMU, this is known as QMP or the QEMU machine protocol that would communicate this information over a socket to some listening program. And other VMMs have similar functionality across VM, for example, has a similar capability. So then just to mention the host controllers responsibilities here that in whether you're in the upper case or the lower case. Sorry, I just lost control here a little bit. There we go. So these controllers then receive the signals and then they give control inputs to the system. And those control inputs, you know, in the in the upper case would be setting memcg limits or driving reclaim for a particular memcg. For the bottom case, of course, we, you know, our control input is changing the balloon size. And then getting this information or also the ability to query for this information. It gives you the information and then you can implement a policy for these adjustments. There's obviously going to be some notion of fairness, even if it's just implied, because there is a limited resource. For the client side use case, we have to be very adaptable to what the user might be doing at any moment. But of course, for more of the data center use case, we might be able to use historical data to inform that policy. So maybe now is a good time to see if there's any questions before I move on to the next part. So just a question on the report part. Brutal balloon already has like a mechanism to report statistics to the hypervisor. Couldn't you simply like extend that that you like just always include whatever broken set size and you just have to hypervisor pull that regular interval? Yeah, I think it would be possible to extend the stats feature potentially. Let me just go back to that slide and talk a little bit about it. You know, the main thing about the stats feature right is the first thing you do is set your polling interval. And then the system begins to report. And so the benefit here, especially as you expand to multiple VMs is a signal based signal based system during reclaim activities. And so that you could imagine that as a kernel feature and then that as the driver of your stats reporting as opposed to a polling. And so maybe extending that way. There's another component of this that could fit inside stats maybe. And that is this thresholding rate limiting that we do so that if there's a lot of reclaim we're not blasting huge amounts of messages through the system. So that's known as a reporting threshold in the system. And the other thing that happens is again in these multi VM systems. If you've just generated maybe some background reclaim activity happened and you just generated a new report and then immediately afterwards you get queried. You want to have some notion of the staleness of that report that's allowed. So again you're not generating additional work. So those two features are a piece of the system that is part of how you configure this what we made here. And you configure it on the host through the CIS FS and then it's part of this expansion of the balloon that we would propose. So I think that you could expand stats to include this. But the way that we did it right now as an RFC proposal is to talk about it as a separate thing and see what the community thinks and proceed in the conversation that way. I don't think it's very crazy to have that. I mean of course the other question is if you could then just have like some other QMU guest agent in the guest push that on demand to your hypervisor. I mean there are various ways to communicate that just to question what the right interfaces. In terms of like a VSOC or something like that. Yeah for example. Yes that that would be a possible solution. It seems there's benefits in terms of a globally useful thing to have something that just works out of the box for everyone. So that is how we've gone so far. But yeah I guess the main thing to say is that you know we I'm actually not sure if I haven't had a chance to ask you on to if he's posted the patches yet. But we have these patches that we would like to have people give some comments on. So just a word about that. Yeah so there's there's two patches that describe this or show what we have so far. And the kernel patches actual changes to have this feature that we call working set reporting. And then the second patch is for the actual driver changes itself. And then we also have the QMU implementation RFC that is actually not out yet but send in the next couple days. And then the cross VM balloon actually already implements a form of this which is available right now. And then for people who are on the vertio mailing list we also describe the spec edition which is essentially this configuration you know sending the configuration information. And then getting the reports back out specifying the bins and how bin sizes and changing those and all of the many details that are necessary for that kind of stuff. Yeah then we just you know you want to kind of refer to this that you can have various kinds of balloon policies will be releasing a basic script that shows a basic policy that how you can implement a basic policy with these unified working set information. And how you can drive basically an auto balloon kind of feature through this to to not have to to make it very simple to control balloons in a in a multi VM scenario. So you want to I don't know did you have any additional comments or. I'm also just doing some final touch ups on the kernel patches as well and then we'll be posting those in the next day or two. Yeah one final note about you know the different because the balloon features there are a number of balloon features there's a free page reporting is another option that you could imagine to help solve this problem. One of the things that we have at least on the client side application and I haven't spoken when you want to detail about this but we have a lot of situations where there's significant page cash in the VM itself. So the number of free pages reported is actually not that large. And so another solution that people proceed down this path is sometimes people call it a balloon pulsing solution where they inflate the balloon to drive reclaim and then they pulse it back down. So hopefully that you now get to report in fact that you have a lot of free pages. And again I think for the data center case that might yield some I think people do yield some wins with that. For our case with the user we're very sensitive to user perceived latency. And so we again prefer like a signal based situation where we can respond very quickly to what a user is doing. And so we proceeded down the path of trying to do a signal based system. So that's really I think that's everything I was planning to go through. Does anyone have any additional questions or I can just add something. So I'm not a friend of auto ballooning if you ask me. I maintain virtual balloon. I don't like balloon inflation deflation. It's worse. I can understand why people do it because it's always been done like that. But what I think would be better in the long term is if you would be using free page reporting to report free pages. But in addition some way to to tell your guests VM to limit the size of its working set meaning that you said some kind of like you would say well your Linux you have like four gigabytes of memory but please only try to use two gigabytes of it something like that. And if the VM then like runs into some out of memory situations you can say yeah well but if you need more instead of just randomly crashing. Like we've seen for auto ballooning just like exceed that and report report me an update that you need more memory desperately. Because like with auto ballooning you sometimes you can really run into these situations where you suddenly have a workload that consumes a lot of memory in your VM. Whatever like policies you have to slow so you have your guest applications crashing your customer will not be happy which is why auto ballooning is usually not used nowadays anymore. So just like to give you like an idea what to do. And I mean it would still fit into the whole picture of the working set size just that you do use a different mechanism to actually put some memory pressure on the under guest without causing too much harm. So the idea being that yeah I recall your comment from the from the mailing list. And essentially one part of this is in you'll see in the spec change is what we would call a notification of virtue where in this case there's really only two notifications one so far. One is a configuration notification the device is now has a message for you of how you would like to how we would like to configure the working set reporting. And then the second notification right now is requesting a working set report. So one of the things that we'd speculated on is is an additional notification of some kind of of this nature of saying this is we would like you to stay within this allocation or something like that. That hasn't been a part of what we've done yet but I see your point for sure. Hi I'm just wondering how this could scale to like a pooled memory type use case. Like do you do you see the working set as something that could be passed externally and be like a per node instead of per process or per VM being a per node metric that's provided up to like an orchestrator that's going to decide to migrate a VM or move a workload or something like that. Well one of the things in that's part of the patch the RFC that you want to is focused on is basically implementing this for hierarchically for you know MCG V2 and so you have to find some way of hierarchically allocating these bins. So if you have two sub MCG's that have a working set and then the parent MCG needs to understand that you have to aggregate right. And so I could imagine a similar you would do a similar kind of operation I would think to sort of aggregate the working sets. Especially let's say people have chosen different binnings and how do you properly you know aggregate up the bins that that kind of thing. So it seems to me I haven't thought very much about that that was you want to know you want to do you have a comment about that. Well you're talking about nodes are you referring to machines or human nodes. Right machine like one CPU or one honestly one root port when it comes to when you're talking about CXL pooling CXL based pooling it's a root port. It's not even necessarily like the CXL plane doesn't have a good concept for which reports are part of a multi socket system or or whatever. But basically having a layer above just sort of process management within one one kernel instance basically. Well I mean the working set would certainly tell you you know there's not very much hot memory on you know this however you want to talk about whatever your system you have. And there's a lot of hot memory over here and not very much over here. And then you could presumably make some kind of decision. You know about that. Yeah and I remember in the previous talk about this idea of like oh there could be many VMs with a lot of hot memory and you know we sold everyone these VMs and this much memory but we know that they're not going to use it but we actually this user or this set of users. Is right now using a lot of memory right now and could exceed our model of how we've set things up. Yes right right so so this kind of is a kind of system that would allow you to detect that right and then maybe make some decision so that you can maintain your you know your SLA. Yeah and it's not unlike your comment about the server use case where you need to juggle these VMs but the VMs might not be the most important thing on those. There are other things going on so you need this broader sense of context and it's actually a thought that occurred when there was discussion about handing that information over to the user space. Well you could have an application say in absolute terms this is the kind of memory I want but what else is going on. Maybe that's important to the application but it's not important to that particular use case or deployment right. It's all it needs to be done in relative terms so that you know there needs to be one single entity that understands the actual priority list. Right yeah I mean this is again you know this driver of trying to construct global knowledge when you have necessarily isolated it because of the constraints of your system. So that is that that's probably the most succinct way of saying what it is where we're trying to do. Thanks. Okay well thanks very much and also thank you you want to.