 Hello, I'm like a part-time research intern at Red Hat and I'm pursuing my PhD at Boston University. I'll be talking about how to provide isolation guarantees while go running applications and saturating the main memory subsystem. So, when many workloads are running in parallel on more than architectures, you have no control on what is happening in the memory hierarchy. And furthermore, you have no intuition to what is happening to your programs runtime. And for some application, it is important to provide guarantees and like some sort of call it use service and contention causes a lot of problems here. So, here are a couple of cases, use cases where having solutions to provide isolation and determinism helps. So, let's take a look at the cloud computing world. Suppose there's a customer that has an SLA or cares about 99% tail latency. The premium customer not only cares about fast execution, but also about deterministic time. Also, there are non-critical tasks running on the same cloud. How do we solve this problem? Because depending on the number of cores active and what workload is running, we will see different changes in performance to the premium user. Like, similarly, in the real-time community, missing a deadline can be catastrophic event and deterministic performance is super important. Like, think about self-driving cars, a simple multi-controller system is not possible anymore. You need a multi-control, you need a multi-core system, but at the same time, how do you guarantee critical tasks will meet their deadlines? So, like, sharing of resources seems like a problem. Many workloads are encountering. And one of the biggest bottlenecks in this area is the main memory. So, and most of the solutions currently are to underutilize resources so that they don't cause any interference by basically not sharing resources. So, a lot of thought has been given to this problem and a couple of industries have tried to come up with their solutions and let me talk about a couple of them and then I'll focus on one of them. So, Intel solution to this problem of main memory being the bottleneck is called memory bandwidth allocation. So, this version of MBA, this version of MBA basically uses a programmable rate control, the 4L2 cache, the other parts of the processor. So, based on the settings of the MBA, a delay value is added to your request. So, we've run here between L2 and L2 because of how the processor is. So, there are like CPU cores, there are memory controllers, there are PCLAs, et cetera, and they're all connected in a mesh architecture. So, hence, except for the memory controllers themselves, there's no confidence. Furthermore, if memory bandwidth as a whole is supposed to be controlled, not even the memory controllers at a global level have this information. So, the current implementation of... Hey, everyone. Sorry for the interruption. I think we lost the speaker here for a minute. I am quickly checking with them to make sure they can rejoin. Hey, Parul, I think we lost you there for a second. You can probably... I'm so sorry, Ed. I don't know what happened, but yeah. Is my internet better now? Yeah, we can hear you good. I don't know when people were... when people lost me. Maybe we can share your slides and then we can figure out from there. Okay, so I talked about Intel solution. I also gave a little bit of overview about the ARM solution. And basically the gist of this, I'm just going to quickly recap the gist of it if people can still hear me. Okay, so basically ARM solution lets you control traffic from different like peripheral devices like GPUs, which make intense memory transactions. But Intel lets you do per core. So that doesn't seem to be like one mechanism that does both, at least for the time being. But I'm sure things are progressing so far because so fast in this domain because everybody realizes there's a need for this traffic within CPUs. It's called... it's a software-based technique. It's called Memcar. It basically has a global regulation period and at the end of the period, the budget gets replenished by an interrupt for each core simultaneously. So each core has its value for the budget. Core is stopped for the remaining time left in that period. So for all our experiments, we statically set the global period to be one millisecond. So this problem of main memory being a bottleneck has been there for a while and a lot of communities are given it thought and trying to come up with solution. The rest of the presentation, I will focus on an embedded system where we try to provide determinism using ARM QoS and Memcar and it focuses mostly on the real-time community. So let me briefly describe the memory model of the application running in this work. So we consider a model where an application runs not only on the CPU but other processing elements as well as an accelerator in this particular figure you can see. You can see that our application is going through different stages. First it does some workload on the CPU then in the green part it swaps to the accelerator and on the way back and at the last part it goes back to the CPU itself and on the lower half of the diagram you can see M1 and M2. M1 is tracking basically the memory transactions made by the application only on the CPU and M2 is tracking the memory transactions made only on the accelerator. So as you can see in the first segment when the application is just running on the CPU the cumulative memory transactions are increasing on the middle graph but they're not increasing on the lower most graph and when the application switches on the green segment the middle graph stays flat and the bottom graph suddenly starts seeing memory transactions and we used a profiler to achieve all these results. So profiling was a key part of this project and let me define what the goals of the profiler were when we started working on it. So firstly the profiler helps gather all the variations in the runs of the application and build the memory envelopes that encapsulate the worst case execution time because that is very important for the real-time community because you don't want you always make sure that you're capturing the worst case execution time and not making wrong predictions because it can be fatal in the real-time community. So as you can see that's why when we have these graphs we have a lower envelope and a bottom envelope which captures all these variations and it does not only like do that for the CPU but it also does for all the memory transactions made at the DDR controller so they could be all the peripheral devices as well. So secondly the profiler helps with consolidating multiple words loads on a system while still providing isolation guarantees. It helps you figure out the settings for both the ARM QoS and memga and see how the utilization of the DDR controller changes and also it helps us keep the entire system below the sustainable utilization limit. So before I move on I do want to say a few more words about profiling as these can be extended to other platforms and setups as well and we do we in our future work we hope to incorporate some of these in a cloud environment as well. So to precisely measure these quantities the profiler must be designed to have two specific attributes so it should have a transparency requirement which means that the task under observation is not or very minimally impacted by the activity of the profiler and on the other hand the higher the granularity of the profiler the smaller will be the pessimism of the worst case execution estimates or you can in the cloud environment the better predictions we'll be able to give about SLA agreements and etc. So the one problem with this is that these two objectives the fine grade and the transparent are kind of like competing so you need to find a good balance to be able to achieve both. So let's take a look at our part of the embedded system. So to ensure that our profiler is super transparent we let it have access to its own DDR controller and in this particular case you can see that CPU4 is acting as a profiler and it is accessing its own DDR controller 1. It keeps track of all the transactions made to DDR controller 0 and but stores them ensure that the application under analysis is not impacted in any form or so. Secondly you can see that the QUS can differentiate traffic between CPUs, GPUs, KPEX they're like more like cores that are more performance oriented but within CPUs there needs to be some form of management which for our particular case we're going to use the software based technique that I mentioned a couple of slides earlier called mem guard. So before I show the final results I just want to put all the moving parts together to understand the requirements for having a full system experiment and still show deterministic results. So firstly we run the application multiple of times in isolation without memory bandwidth regulation and we capture the worst case execution time and the DRAM usage of that application. Then using that algorithm we because the basically the profiler gives you just raw numbers and using our algorithm we take the profiles and convert it into upper and lower memory transaction curves or the task model that I described earlier. Once this is done we can now predict how the application runtime will change under different settings and throttling levels of both mem guard or QS and that was a key part for us because in real-time community predicting how the application is going to change when other things enter the system is very important. So there are like multiple overheads that we had to take into account for like I'll mention one of them which is like when we set mem guard to a particular budget for each core it uses some of the assigned budget for doing its own big bookkeeping. So like keeping track of how many cash line bills have already happened how many are left in that particular period and saving some for like sending the interrupt etc. So basically we had to make sure that the actual budget that we were assigning to the application could track the fact that mem guard used some of the budget because we didn't want to say that oh we had mem guard was set to x amount and the application was using x amount but in reality the application was getting something less than x. So there were other such details and that we had to take an account in our algorithm and for more details you can find them in the paper we just published. Once we have the necessary elements for the application we focus on the DRAM utilization. So we experimentally first captured how much maximum utilization can be tolerated by the DDR controller. So each like you every DDR controller has like a theoretical limit on like it's manual or something but you we all know that the actual amount or actual stress that it can take is usually less than that. So experimentally first we calculate captured that. So for and then furthermore what we did was for each setting of mem guard and QoS setting we ran a bunch of experiment and using the profiler calculated how much each setting had an impact on the utilization of the DRAM. So we were always tracking like if I said an application on x amount of setting for mem guard and by setting on QoS then what would be the impact of it on the DRAM because we always want to make sure that we were actually saying below the sustainable saturation level. And once we have all these metrics we can run a full system set of experiment and hope to get predictable outcomes. So that before I end my presentation let me just show some of these results that we gathered when we ran everything concurrently on our system. So first we took our results from one particular benchmark that I'm trying to show here. Our profiler gathered the raw data and then we converted it into upper and lower cumulative memory transaction. So you can see that image on the top. On the y-axis we have the number of memory transactions which are always increasing because these are cumulative memory transactions. And on the x-axis we have the entire execution time period of the application and we attract and we basically run the profiler for the entire length of the application. And as any application we can see this goes through parts or the applications makes many memory transactions but then there are also parts or segments where it mostly is just doing compute and does not have a steep incline. Because by what I mean by not having a steep incline is that it stays flat that means that that particular instance in the application even though the time went forward the application did not make any access in the DRAM. It might have used the cache but it did not make any DRAM traffic. Once we have that we use our algorithm to make predictions on the execution time with different memguard settings. So you can see that on the graph on the below. This on the y-axis now I have the execution time or how long the application took to complete. And on the x-axis is the knob we control and in this particular case it's the memguard budget. You can change this memguard budget from a cache line refill to a bandwidth value and we do that in our paper. But what you can see is that when we compare our results our predictions are always over and never under but while they're always over they're not overly pessimistic. They're very close to the actual results that we measured with the profiler and that is also important because you don't want to like you don't basically if you were over predicting then it's like you can just run the one application on the entire system and then for sure there is nothing else competing and your predictions would be fine but you want to be able to make sure that you're actually utilizing the system to its full potential. Now I just want to show you some of the results for the full system integration with all four cores that we had running applications simultaneously. So we use one of the cores to execute the Roy benchmark which uses both the accelerator and the CPU and on one other core we run Emser benchmark which is a machine learning benchmark with VGA input and lastly on the other two remaining cores we just run two memory bombs which are basically accessing the DRAM continuously. Fine so basically after we did a lot of analysis and math we figured out at what particulars we set the QoS to a fixed amount at a fixed value and at that fixed value we knew that it was contributing to around 30% of the utilization. There was a having the profiler actually also helped us a lot because what we realized that when the accelerator was playing there was another component on the hardware which was the display control unit which also turned on and started making DRAM transactions. If we didn't have the profiler there would have no way for us to exactly know and our results would not have matched our experiments and we would have been very confused about it but the fact that we had the profiler it helped us analyze that there is a display control unit even though it's not plugged into an ioport it's making transactions to the DRAM and contributing to the DDR utilization and that was around 36% and then all the four cores together have access to the remaining 30% which basically corresponds to a memguard setting of around 5000 and what we did with around what we did was we split that evenly and we set the e for each CPU budget we gave around 5000 divided by 4 and that was the maximum bandwidth that could be memguard setting that could be given and still ensure that the saturation was below the sustainable amount and as you can see in this figure where on the x-axis we slowly increase the memguard setting so you can see it goes from 492 to 819 that is basically the memguard knob that we are controlling and like I mentioned earlier we could go up to around 5000 ish divided by 4 and without having any problems and on the y-axis on the y-axis on the right we have how the execution time of both these benchmarks Emser and Roy is changing and on the y-axis on the left we have the utilization and the dotted lines are our predictions and the non dotted lines are what we observed with the profiler as you can see that our predictions are going in par with the actual results and they're slightly above and as I mentioned earlier it's very important in the real-time community that we would rather be it's better to be safe than to have under predictions but but as soon as the black line that is that the predicted value of utilization goes above 100% all our predictions stop working this makes sense as we push the ddr beyond its capabilities and introduce levels of noise that it can't be tolerated this particular ddr can't control controller cannot manage the system very well and there is a lot of contention happening at the ddr controller and as you can see there is a huge spike in what happens to both the applications and their execution time as soon as you push beyond those limits so some concluding remarks um so each system has a theoretical and experiment sustainable bandwidth or utilized in level at the memory controller and it is extremely important to experimentally gather where this data points lie and no matter how many applications with peripheral devices included should not exceed this threshold or this saturation level profiling is very important and can help us understand the different stages of the application provide us insight into like selecting different controls for the new memory bandwidth controls that are coming and with the profiling tool but we have to make sure that the profiling trueness tool is transparent and has fine grand granularity so we are currently working on a profiler that will also work on cloud workloads and hope to be able to present that in the next dev course hope you found this presentation insightful and thank you that is for coming here thank you for the great talk parol i request everybody to carry on the discussion in the breakout room belly the link to which i just posted on the chat um thank you again