 Push yourself and stop. OK. Hello, guys. I'm Paru. I'm a PhD student at BU. This is like a collaboratory project between Red Hat and BU. I'm going to be talking about how, in certain applications, the main memory can be a bottleneck. And we're proposing a certain mechanism to get around it. OK. So most modern architectures have multiple cores. And you can run a lot of workloads in parallel. And they have sharing at different levels of hierarchy. And with inherent sharing comes a lack of isolation. There are usually two types of isolation. One is spatial or space-based, which a lot of thought has been given to it. And there's lots of mechanisms in place to ensure that you have spatial isolation. So applications of different users are isolated, even with the user and the OS. And even within the OS you have isolation via VMs. But the other kind of isolation that is there is temporal or time-based. So what do I mean by that is the fact that when two applications are running on the same system, when one of the application affects the runtime of the first application, that means there is a lack of temporal isolation. And there are usually three main reasons for it. One is the fact that there is sharing at the cache level. Then that the main memory is a bottleneck. And also at the memory controller level, because the DRAM can only perform so fast. So with keeping all these three things in mind, what really affects your workload really depends on what kind of application you're running. Because if you have something that's doing a lot of compute and has no interdependencies and can be run on multiple cores, then none of these things would actually affect you. But most workloads that run on cloud these days not only have compute, but also have huge memory requirements. So what do we know about the current DDR structure? So the current DDR4s, you can get the maximum bandwidth that you can extract from a memory bus. It's around 20 gigabytes per second. And one of the cores on a socket can already extract up to 5 gigabytes per second. So if you have four cores that are running really memory intensive tasks, you've already saturated the memory bus. And any additional task that you add will create contention for this shared resource. And you will have a lack of temporal isolation. So we know that theoretically this is going to be a problem. But we also want to see how bad this gets at an experimental level. So before I go into the experiment, I want to talk about the synthetic benchmark that we created so that what we were running, we had complete understanding of this. So we started by creating a 100 MB buffer array. And we just looped over it 100 times. So at one instance, in one of our experiments, we were almost accessing 10,000 megabytes. We wanted to create low cache locality. So what do I mean by low cache locality? It's that every time I access something in this array, I should get a L3 miss. And I should have to access the main memory. So how we did this was every time I looped in the array, and the next entry that I accessed, I was skipping 16 integers. Why 16? Because that's exactly 64 bytes, which is one cache line. So the next entry that is 16 integers apart would have been in the next cache line. That would cause an L3 miss. And you would have to go to the main memory. So what we did, we ran this experiment, and we actually measured how many L3 misses we were getting using PAPI. And what we saw, there was a 15 to 20% discrepancy in our value. So the theoretical value was actually 15 to 20% higher than what we were expecting, which was a little suspicious to us. And at this point, we had already turned off all the hardware prefetchers. So what we thought was maybe there's another prefetching happening that we haven't been thinking about, and we haven't been tracking. So we tried another approach. Rather than skipping 16 integers, what we did was actually skip a really huge prime number. So what this does is from the computer's perspective, it looks pseudo random. So it cannot figure out the pattern that you're trying to access. And this would fool the prefetchers. But what we still saw, that PAPI was still underreporting our L3 misses. So the hypothesis that we had was the fact that when you use Intel machines, their L3 replacement policy is not completely random. There is some kind of pseudo random replacement policy. So some parts of the L3 cache actually get used way more than other parts. So the way we dealt with that was we, rather than having a buffer size of just 100 MB, what we did was we increased the buffer size to be 2 gigabytes. So what this does in one iteration, you have way more data. And then way more data needs to fit into the L3, because you're accessing it. So even the parts of the L3 that were less likely to get evicted, there would be more data trying to fit into those. And you would actually replace them at some point, which actually did end up working. And we ended up getting more accurate results in the N3. We ended up having a one-person discrepancy. But that is much better than what we were seeing before. And this kind of proved our hypothesis. So now that we have our synthetic benchmark, which there are two options that we have in our synthetic benchmark. One is just performing reads, or the other is just performing writes. So all the experiments that I talked about in this talk later, I have results for both reads and writes. But for the presentation today, I'm just presenting the write data. So I've explained the synthetic benchmark, and we completely understand this. So how bad is contention at the main memory when you use a completely memory-intensive task? So this is the result we got. So on the x-axis, it's the number of other cores that are also performing writes. So you can think about it as number of other cores that are contending for the main memory. So if you look at number 10, that means there are 10 other cores that are also trying to access the main memory at the same time as the core under analysis. And bandwidth is what the core under analysis is how much memory bandwidth is the core under analysis getting. So like I said earlier, theoretically, that one core, when nothing else is running on the system, can get around 5 gigabytes. But as more and more cores get activated and you start contending for main memory, you see a drop of performance drop of almost 50%, which is like if a task is really latency sensitive, this is a huge drop. So we've shown that this is a problem. And the fact that when you have multiple workloads running on the machine, there is very little intuition of how it's going to affect one workload in particular, when other applications are also contending for the resources. And a lot of actual use cases are there for this. So imagine if you are in a cloud computing environment and you have a premium user who has an SLA agreement or cares about the 99% tail latency. So how do you make sure if you're running any non-critical task on the same machine as this premium user, how do you make sure that the premium user is not suffering a lot and is actually making these agreements? The other is in the HPC community, like a typical HPC workload has multiple threads of execution. So how do you make sure that these threads are all getting completed more comparatively closer to each other so that you get a better overall execution time? And even in the real-time system, a real-time community, you care about this because deterministic performance is extremely important. And how do you make sure critical tasks actually make their deadlines when other non-critical tasks are also competing for the same resources? So currently, all these industries have their own way of solving this problem. So in cloud, you have a lot of underutilization of resources because they don't want contention happening. In HPC, programmatically, people are making sure that everything is getting done as it should be. And in real-time, people are sacrificing performance time and execution time by actually implementing software solutions to these problems. So other manufacturers have also realized that there is a need for innovation in this direction. And one such technology that has come out has been Intel. And there are two things that this technology provides. One is the profiling aspect. So if an application is running or number of cores or a virtual machine, you can see what the last-level cache usage is for that application number of cores or virtual machine. You can also see how much memory bandwidth it's using while it's running. The second aspect of this technology is the management part, which lets you partition your cache and also divide your memory bandwidth into parts. And the intuition we had here was that it would maybe look like a traditional network bandwidth controller where you have a reserved quota of the total bandwidth. And you just give it to a subset of the application. That was the intuition we were having when this came out. And then we ran experiments to see if we were going in the right direction or not. So before I get into the experiments, I want to just briefly explain the architecture and the test setup. So even with the previous experiment where I showed contention is a problem, we were using this machine and this setup. So we're using the Intel Skylake architecture. And it has two sockets. On each socket, we have 20 cores. And each core has two hyper threads. And the last level cache size is around 27 megabytes. So because we wanted as much, as less noise from other things that were running on the system, we took certain kind of precautions. First, we made sure that our CPU was only running at a constant frequency. So when we're not doing memory transaction and doing some CPU workload, we're always getting a constant time rather than some discrepancy in that. We made sure that our CPU was never going into idle state. We turned off a turbo boost. We turned off the hardware prefetchers. We made sure all our experiments are only run on one of the sockets and all the kernel calls and stuff is pushed onto the other sockets. So it doesn't interrupt or anything when the experiments are running and give wrong results. Second, lastly, the resource director technology or the RDD, there are two ways to control it. So the one that we're using is the one, there's an Intel library on GitHub which lets you change the registers, the MSR registers and stuff and set this thing up. The other actually is via the kernel, via the kernel which exposes a file system interface which lets you also manipulate this. We have currently turned that off because we didn't want multiple ways to control the same registers and not be able to make sure that if one of them was changing, the other wasn't doing anything. So for the rest of the presentation, I'm only gonna be, even though RDD has multiple features, for the rest of the presentation, I'll be only focusing on one of them which is the memory bandwidth allocation. So there are eight classes of service and these classes of service are 10, 20, 30, 40, 50, 60 and they skip over 70 and 80, then 90 and 100. And so what you do is you take one of the classes of service, you assign it a percentage and then that class of service is further assigned to a core or a set of cores. And then based on what class of service the core gets, you get a delay value. So all this is very theoretical knowledge. Like we don't know what these percentages mean, where like what are their percentage of, where is the delay value being added? But so the rest of the experiments we did were trying to figure this out and what these numbers actually mean. So on the X axis, we still have the number of cores causing contention and on the Y axis we still have the bandwidth of one of the cores under analysis. So the core under analysis is always getting 100% of the bandwidth. And the other cores that are causing contention, they're put on the same class of service and what we do is we try to assign all the different classes of service to it to see if any of the settings actually gives any protection to the one core that we're giving 100% to. So it means like when we give 40%, then we're trying to say that if all the other cores are being throttled at 40%, does our core that is at 100% actually gain any performance isolation? Because you should, as you can see, in the 10% case, there seems to be some kind of discrepancy here. Because if you look at this line, it's going right here under the blue line. When there are 12 cores causing contention, you're here, but when you have all the other cores being throttled at 10%, then you actually end up getting way higher bandwidth. So you get like a 45% increase in bandwidth when all the other cores are being throttled at 10%. But that's not the case for most of them because the lines for like 90, 60, 50, 30, and 20 are almost at the same level as if there was no throttling being done for them. So this seems like it does not provide any protection to the core that we are trying to isolate. So another thing that we wanted to see is like how does, when we put multiple cores on one class of service, how are these cores actually getting their bandwidth? So we had two hypotheses. One was the fact that if you have multiple cores on one class of service, they're all subjected to one amount of quota. So no matter how many cores are there, you should all, the total sum that you get should be about Q. The other hypothesis we had was the fact that if you have more cores, then you're actually extracting way more of the quota because you're each getting a Q amount of quota. So if it was case one where you had multiple cores on the class of service and you're all subjected to one Q, then as more cores ended up getting onto the class of service, your bandwidth for that one core, for each individual core in that class of service should actually decrease because you're all splitting the Q amount of quota that you had. But if you were in this scenario where you had your individual value assigned to you and no matter how many cores came into that class of service, you would still see this value. Then your line should have been a flat line. And we ran this experiment. We ran it at 10% so that we wanted to make sure that the system was not saturated because if we actually made the system saturated, we would have seen a decrease in bandwidth. So we made sure we were running at 10% and we kept adding the cores. And as you can see, we get a flat line. That means that each individual core has its own value or quota assigned to it. Even if you're throttling everybody at 10%, but if you have like 10 different cores that are all in that same class of service, then it's extracting like 10 times the bandwidth. So the core that you're trying to isolate and you're trying to protect will still see so much more contention if you had all the core share amount Q of bandwidth. So in my opinion, it made more sense to have all the cores that are on the same class of service share the amount of bandwidth totally. Then you would be able to say if you have X, if your system has like 5,000 megabytes and if you throttle everybody at 10% and everybody that's in that 10% has to share around 1,000 megabytes. Then that 4,000 megabytes that is leftover, you have still so much more protection unless you have so much more protection because no one else can go over that 1,000 bound. So why is it that each core has its own quota? It's because the way it's been implemented in the hardware. So because it's in the core part, so this is where the delay is being added. It's being added before, it's being added between L2 and L3. So the delay is actually being added before it even leaves the L3 to go to the main memory. And this is done because it's much easier to keep track of all the threads when they are still in the core part because you have the core IDs and everything attached to it. But once it leaves the core part to go to the uncore part, it's much harder to keep track of all these extra variables. So okay, we have this between L2 and L3. That means that even if something is in the last level cache, a delay would have been added to it. And that's the next thing we wanted to see. But before we did that, we also wanted to make sure that it actually made sense to add delays to something that was already in the last level cache. Because everybody, at least for me, I thought that if something is in the last level cache, you're already getting such great performance, why would you want to add delay to it? So what we saw was on the x-axis, we still have the number of cores causing contention. And on the y-axis, we still have the bandwidth of one of the cores that we're trying to analyze. What we saw was even when two cores were contending on the last level cache, oh, I'm so sorry, I forgot to tell something. So the fact was that the buffer size has been decreased by a lot. So before we had our buffer size to be two gigabytes, now the buffer size is only four megabytes. So even when we have six cores, it should easily fit into the last level cache. Yeah? Look at your cache, it's still associated with it. Yeah, yeah. So yeah, you can't guarantee that you'll do a small chunk of everything that you like. But the thing is we're looping over it. Yeah, because it's algorithmic based on the address. The probability is high that you're having it. So we still have something that should six times four is 24, we have three megabytes left. We have one way there, one way there. So like people that it fits in. So it will add some noise perhaps at the end of the graph, but in general, she's coming right, it will fit in. Also the fact, should we loop over it like multiple times at some, okay, okay, okay. Yeah, it's also from the replacement section. Oh, okay, okay. The policy also which we can implement so the program that you could see. The solution. I did not really rank them. It's also not, it's not LRQ, it's the middle thing, they call it PLRU, Sulu, these brand news, which by itself add in the noise as well. And after the approval. Yeah, but that's for a business policy, that's not the address there. Yeah, yeah. Yeah, no, the address, I think so, one thing which Pru didn't say is that for this machine, we turned off money until leaving. So then we are mapping the net resolution to the URM modules, resolution, and therefore we have the ashen policy much simplified, so in a sense, okay. Do you want to add anything to that, Tom? Yeah, so in case there's also around multiple times as well, so every time I look at new physical memory, it's unlikely that even if sometimes you get that sort of unlucky spot where you have conflict that may go out of the cash, right? Because this was around multiple times. In the end, we took an average and essentially this trend comes over again, so something else. So even with like two other cores, you see that your performance is going down and even, so by the sixth, by the fourth core that is active, you're almost, like your bandwidth is almost at the same level as if you were going into the main memory. So it seems like even though your data is in L3, your performance is acting like you were going into main memory, which is like pretty shocking to me. I mean, that was shocking to me, I don't know who out everyone else. So we saw that contention is happening at L3 level. So we were like, okay, if this is not helping, if MDA is not being able to provide support for isolation at the main memory, can it provide support for isolation at the last level cache? So we repeated the same experiment where the core under analysis was always given 100% and all the other cores that were contending went through all the different settings. But the only difference is now rather than having the buffer size P2 gigabytes, our buffer size is four MB. So some of, at least for some of the cores, it should have fit into the last level cache. So what we see is, so this line that falls right here is the base case where we are not having any control. We are, no one is being throttled and everybody seems to have, but the default setting is that everybody gets 100%. So that's the default case. But if you see that when you throttle all the other cores that are causing contention at 10%, you can almost scale up to 16 cores without having any degradation to your performance. So at 16 cores, it's like you are seeing performance as you were hitting the L3 every time. And for when you throttle everyone else by 20%, you can still scale up to almost 14 cores and then 12. And you see that even though that they might not be like these percentages don't mean something, like they're not equally spaced like if they should be, but they're causing some kind of performance gains. Yes? Question, in that graph, is there still one thread who has 100% of control? Yes, yes. And this is the number of other threads that are controlled at the given percentage? Yes. The graph's actually the one, the one that's getting full bandwidth. Yes, so the, You're wrapping the one that's getting full bandwidth. Yes. In the presence of a specified number of other threads getting control of those threads? Yes, yes. Thank you. Yeah, so this was good news for us because when we were doing it for the main memory bandwidth access it seemed like the hardware wasn't doing anything. So the kind of hypothesis that we came from this was maybe that the delay value is so small for all these like throttling levels that it's only stopping it for such a short period of time that all these accesses are hitting the main memory and they're still causing contention because there are so many other accesses already waiting there. So if somehow the delay value could be increased and we could make them more sparse at the main memory level we would actually see the same benefits that we see here at the main memory. So this is what where we are here right now and I'll talk about a couple of next steps. So one of the next things that we wanna figure out is the fact that clearly there is some benefits in using MBA when the data size was for MBA but there were no benefits when it was around two gigabytes. So where is the sweet spot where MBA actually works and when does it not? So like clearly there at some level the delay value stopped making sense and at some point they do. So we still need to figure run more experiments with different data sets and different data sizes and see where it works and where it doesn't. The other thing is that the other thing that we need to also do is like all our experiments right now have been on examples that were only doing memory transactions. So like it was either doing one read after the other or it was doing one write after another write another write and there was no actually compute period of this experiment. So what we need to do is also simulate a real experiment which has both memory transactions and compute and the way we would do is like have some kind of memory transaction then have a busy loop then have another kind of memory transaction and just experiment that and see how MBA reacts to that. So before I end the presentation I just wanna give a very brief thing about what the big picture is. So imagine if we can profile an application for how much memory it uses over a given period of time. So on the x-axis I have the time, the execution period of an application and on the y-axis I have the memory that it's trying to access. And then I look at this, I can somehow profile this. I see that this initially it wants to access something in the L3 by the bandwidth and then it wants to access something in the main memory and then finally it has something that it can access from L1 and L2 so it's pretty fast and the bandwidth requirements it needs is pretty high and it finishes an X amount of time. Then if I have all these memory like the bandwidth controls then what I can do is I can because this was going in L3 and I know that MBA would work on that. I can throttle it to some amount by using MBA. Then when it was going into main memory if MBA doesn't work then I can use some kind of software based mechanism such as Memguard or something to throttle it at a different level. And then when it's going into L1 and L2 it doesn't need any throttling and it'll just work as, because those are still private so it would just work and I would be able to give, have a hypothesis or predict the new execution time. So the hope is that we can build an end to end system where we can profile a set of applications and then we can co-locate them on latency sensitive applications on the same machine. So not only to get guarantees about temporal isolation but also about, also to make sure that the machines are getting high utilization and thank you. Just a curiosity, do you see a substantial difference between the question? Do you see a substantial difference between the read and write? Yeah, so it's almost twice. So when you see the, It must be twice. Yeah, because H write is like twice, two accesses to the main memory. So like remember that contention graph which had a 50% drop. So when you're just doing reads you only get a 25% drop. But we don't know yet how the quotas are actually right by the double lines whether they're counted once or twice. Yeah. And then the second question I had was, does the do, this is an earlier version of the implementation of this, right? Yeah. Do we have any data on the later version? Well, Joe has to provide us the machine. The implementation. You know, we don't have any data. No, okay, there's no data at all. So I'm just curious about that. I haven't even looked at the new machine or anything. Okay. I think. Thank you. Couple of stupid questions. How many memory sticks do you have in this machine? Like, like, like three? Memory modules. And these are eight memory controls per socket, my thing. Of course, now we have eight on that. So we probably, this is 256 megabytes. Two socket, I'm just like, eight per. So six and we can have, so eight. Eight, okay. So you used all the channels? Ideally, yeah. So you were using the channels, whether the mapping of the memory really used all memory channels. That's simple to question. Sure. We haven't really seen that, but we already couldn't have used all the memory channels. Okay. Thank you. And we made sure that was local memory. Yeah, local memory, I think, so that's good. I'll be quick. So it really is sort of an intuitive, what's your intuition? How workload dependent do you think these tunings are gonna be? I think they're gonna be very workload dependent because so we recently also read that what they did, how they did the delay values. They did it based on one application. Okay. So you have no idea what these delay values are or have any idea what that application is. So maybe for the application that they used, those 10% 20% actually meant that the bandwidth they saw was actually dropping to 10% and 20%. So they named them those. But I think the fact that we've already seen that when you have different sizes, like I know that was a drastic change in size, but when you have seen the different sizes, there was a change. So I think that workload would change things quite a lot. It's not only the delay between them, it's also the mix between how many, at any point in time, how much memory you put in certain time periods, et cetera. So they really have some specific application in mind and part of what the crew just mentioned that in the last slide, we're gonna look at what she's gonna look at next because we make what we look at next. It is to have right both, which is simulating the ratio between accesses and right work they're doing. And it's also when they might have a time to see whether they can approach them. That was the crash. And learn more about the time in that case, which is amazing. Right, so they don't have an anti-starvation mode. They really only have a rate-limiting allocation mode. The thing is here, so anyone who doesn't know, this is a poor man's instrumental. They try to get the bullet points on the slides for the market and so on. And any real implementation of this would have been the implementation of the course, but that's complicated because you need global state for that. So they went with a simple thing. They put these C.O.S. guidance in the state of the core and therefore they're affecting the L3 accesses as well. So they're right in saying that they can limit the bandwidth utilization, but they don't save the neck of science. On the other hand, I think one thing that came out when it was interesting was shock like that. Is that I typically don't think as an L3, as a bottleneck, right? And yet in some workload, and perhaps we are being very logical in terms of which workload we deploy, actually L3 is a bottleneck for some type of workload. Imagine you go and try to write your application to be as L3 sensitive as possible because you want data locality and try not to go in memory so often because that's going to be a performance hit and try to fit in L3. And then even if you have a partition, a cat allocated with cat, still the bandwidth to L3 is going to be a performance, it's going to be a performance hit. And how much is going to be a performance hit? Well, as bad as going to main memory. So you may as well not optimize it at all and just go to main memory if that's the worst case. Now, in this poor man implementation that Intel did, actually gives us some, some except the books to control that kind of contention that we were not aware of. And maybe Intel was not aware of, or maybe they were. Maybe that's what they were observing when they tested their test case application and say, oh, yeah, that's it. But I'll tell you, I can say, so Bruno mentioned at the beginning that in the, especially in the field of real-time programming on some, people are already sacrificing absolute performance for predictability. So if you are writing your program so that it only expects 50% of the total performance and implement the restrictions using the available cache location, memory method location. So the thing which we might want to get through, but if she manages to get this to the point, is actually to say, we can actually guarantee something in the absence of something which Intel has not been aware of, actually six percent have not been able to provide them the best, any kind of guarantees. They were simply off limits, were not possible. If we are limiting the workloads in some, perhaps we can get and say we have a tail latency of whatever, which we actually can guarantee. So that would be a positive outcome. So by the overall, so Bruno always felt bad that she didn't have the big thing and can say, I provided you with the library which can do everything for you. So we keep telling her, please tell her as well so that a negative result is also a good result and it's good to publish it. Just going. A question back to you again. In what scenario was the L3 cache as slow as going up the main memory? Can you re-describe that scenario? So the one. It's essentially the case where you're working set size of your application, you are co-running on different cores. It's still, the sum of their working set size is still below the size of the L3 and yet because they're all contending on L3 bandwidth, they have a person. They're all contending for L3, not for main memory. So it's a bandwidth out to main memory. All the bandwidth is L3. Yeah, that's correct. The control is between L2 and L3. And on the skylight, L3 is the victim cache. So if you victimize L2, it goes in L3 but you're competing with all of them. Right, that's what the problem is. Correct, that's why writes create more problems because you're reading a write back to it. But you can combine this with cache applications. So you can, in this case, you could guarantee that you have it but you still would not get the backup. So that the next experiment is like actually split. That was one of the experiments that I needed to do and these people have been telling me to do it for a while. Is to put each, when I do this 4MB buffer, I actually, so there are 11 ways on the machine. So I give one way to each of these six core and the remaining ways I give to all the other cores that I'm not even monitoring or caring about so that make sure that none of these cores that I'm trying to care about take that extra part of the L3 out. Then I run the same experiment when they have their fixed section of L3 and I know for a fact that their data is in that way. And I would probably still see the same amount of contention and drop. So if you read the Intel architecture documents that they publish in the Erata sections, they do describe on Skylake that bandwidth, cache allocation, cache technology, cache memory monitoring and memory allocation. Don't work as described. The cache allocation will not restrict to the cores. It'll leak over the bandwidth numbers, may not be real, the bandwidth, whatever. So, and they're working toward fixing that in Cascade Lake and beyond. Have you seen where that has cost you numbers, irregularities in your numbers? So we haven't worked with Gad yet. I'm sure I believe you that they were over spilled. The one where we did see a lot of irregularities was when we were trying to use the memory band with monitoring. And also, yeah, when we were doing the memory band with monitoring, there were two different levels. So one was local and one was remote. And the local and the remote did not always add up to the total correctly. And also, when we just looked at the total, we would only see around 90% of our data in there. Like, because we would, so when we were doing that 10,000, we would only get around 9,800, 9,000, I think. Yeah, something, we had some discrepancy in the total amount of data that we were missing. So monitoring, there was some missing? Sorry, sorry, sorry, sorry, sorry, sorry, sorry. So we had misses all the time. Yeah. Work build was always missing. So that's why one had to capture. Yeah, because we did that whole pseudo random access and the two gigabytes to make sure. The small amount that you learned but in this case, it's highly, highly unlikely. So with a two gigabyte reference on a new, if you use a pseudo, if you use a pair-wise prime, it's right at through the arrays and you're really making your own. So that way you have so many catch lines to access and get it's highly, highly unlikely that any little, any one of them will actually remain anything. And she's disabling all the attention. That's all kind of people. So it's unlikely that this has to do with any of these. And also the numbers don't kind of, it's one megabyte in L2 cache. So the size is not large enough to make such a difference. So I got a couple, not questions, but just sort of comments here. First of all, do you think it's interesting to, since this is a memory bandwidth allocation and control, do you think it's interesting to have a buffer size that fits in the L3 cache period? Wouldn't you want to miss in the L3 as much as possible? That was our initial intuition and that's why we were doing that huge thing. But when we saw this, I don't know, here. When we saw this, our intuition was the fact that it's between L2 and L3. That means that the delay is even being added. When you miss something, so I think there's like the underlying like assumption that if you miss something in L2, you're probably going to miss it in L3 as well. And the fact that it's easier to implement this in other core part versus the uncore part, but yeah. Go back to the previous one. Even before seeing this, the thing that give away was go back to the beginning, I think. Okay, can I interrupt for just one second? This one, yeah. Right after this talk, if you get on the registration, we're giving out the party tickets. So just want to let you know, if you want to come to the party, it's Friday night, seven p.m. and it will be here. So, but continue, I just want to make sure that we're going on to the next session. And then the next thing I wanted to ask is have you guys thought about the NUMA aspects of the system and how that affects this? It's all local, man. Okay. And then that's what I was going to ask. One of the first slides I talked about shared memory here. You're not doing shared memory. It's not system five shared memory. I mean, it's L3. I mean. Shared memory, sir. One of the first slides on this, it talked about shared memory. Individual processes. You're doing malloc or MMAP version to get the memory. It's not system five shared memory. And also the fact that I'm using NUMA CTL to make sure that I'm only accessing the main memory of that particular system. Right, but it depends on how it's allocated. Well, memory policy should be the mission. It's getting, it's coming from malloc or S-breaker, MMAP anonymous. MMAP, yeah. Okay, all right. And then the other question that I had was have you looked at the possibility of the three different page sizes and Intel impacting the system? We have also the, we have experiments for both big pages and normal pages. What about like gigabyte usage? Yeah, that's the big pages. How did you get those? How did you get those? Because they won't come from malloc or MMAP. No, the MMAP. You have a new flag that tells you and then you can also double check. You checked that, okay. Yeah, I double checked it in my code. In the code itself. Yeah. I just was curious about it. Oh, of course. Yeah, the first, very first part of it was controlling the variable that can affect the experiment. And yeah, you had PAPI measure the misses to the TLB misses for the huge pages versus non-huge pages. Thanks for all again.