 Hi everyone. Good evening. My name is Sunku. I'm from Inter. My co-presenter is free there. I'm inspiring. He couldn't make it but I'll be covering for us both. Topic are you insured against your noisy neighbor? So here's a legal disclaimer. Thanks to the team. Let me start up by saying thanks to the team that made this work possible. So brief agenda. We look at some of the common contention in the cloud environment and look at some of the hardware technologies and software that's available today to avoid the contentions and increase our determinism. So let me start by indicating the common contentions. So in a cloud environment the first one of the first directive is about minimizing total cost of ownership and more often than not in doing that often leads to oversubscription. So here's where you tweak your quality of service requirements like SLAs and service availability, throughput, latency, scaling, etc. Although cloud and NFE deployments have many subtle differences. More often optimizing these hardware resources especially CPU resources leads to shared resource contention. These shared resources are mostly your last level cash in a CPU, memory bandwidth, PCI bandwidth. In today's deployments like most of the orchestrators may be Kubernetes, OpenStack. They don't have mechanisms in order to control these shared resources so that you can provide more determinism. So digging into a little more to see how you can achieve that. Just to give you an idea of why is last level cash important. So if you notice across the CPU you might have multiple workloads running but the last level cash is always shared across these multiple workloads. So more often these sharing of this resource often leads to up to 51% of throughput degradation comms workloads. If you take a simple workload in this case SPEC CPU benchmark, VZIP2, based on the amount of cash you have, your performance goes up and down quite a significant effect up to five times the performance difference. So if you consider any NFE or real-time deployments they're very sensitive to last level cash. So in order to avoid that Intel has introduced resource director technology on a CPU. So umbrella of technologies. A couple of technologies we'll look into today. One is cash monitoring technology. So it gives you hardware infrastructure to get advanced telemetry of which application uses what amount of cash or which core uses the relevant amount of cash. And based on this you can identify misbehaving applications and move around them. The other one is cash allocation technology. Just monitoring isn't sufficient all the time. So you have a way now to actually control your cash or assign or allocate a certain specific amount of cash to your corresponding applications. So this way once you identify like your high priority versus best effort type of applications you can give a dedicated section of cash to your high priority application thereby ensuring that your performance is guaranteed for that application. So cash allocation technology provides you the required hardware infrastructure to do that. So how do you achieve this determinism that we're talking about. So like we were saying like a lot of workloads are co-located all the time in a cloud deployment. While you're tuning your quality of service especially in this case for cash you're looking at your throughput and latency measurements. So one example is in terms of noisy neighbor avoidance if you take content delivery network you have two different media streaming applications running. Each is noisy to each other because each application have a lot of memory reads and writes a lot of cash utilization. So each can be noisy to each other. Now how would you detect and control and avoid workloads like this. So we've done some analysis with the OPNV VSPerf just in the previous presentation we got an introduction. So with the VSPerf it's a fully automated test suite that can deploy your NFE, NFEI and virtual switch. In this case we used the OPNV switch and so you can define and implement your test cases scale your CPUs for VNFs. So VSPerf supports a lot of traffic generators in this case we used AXIA traffic generator and so a simple case would be a simple application running in a VNF and you have your virtual switch and you're dedicated 10 gig nick. So in our case we used a spiron cloud stress as a noisy neighbor. A noisy neighbor is something in this example it does a lot of cash thrashing there's a lot of reads and writes. So spiron cloud stress it's a nice web-based infrastructure it can deploy it as a VM it can configure it to emulate a certain set of workloads so in this case you configure it to have a lot of memory reads and writes it can tune it to perform as a firewall or router so it depends on how you tune it it can stress your compute memory or storage. This is from spiron that we run along with VSPerf so this is deploying a VM so along with your workload under test you deploy this VM so that it acts as a noisy neighbor and in order to get the metrics we use Kalecd as a part of a barometer project so Kalecd is being there forever I mean it's it's most widely used statistics collection deeming it has a multiple read or write plugins that gives you the required set of data in terms of CPU last level cache etc and the collection interval is configurable for example in our case we used a one-second collection interval so what that means is for every second you're getting the cache utilization per workload basis per application basis and so we used Intel RDD plugin that gives you this advanced telemetry of per cache statistics on your CPU. So here's the test setup so you have a simple traffic generator sending in 2000 flows on an Intel platform with 10 gig nicks we have a OVS with DPDK deployed with VSPerf and your VM under test VM under test is a L2 forward test PMD application and two noisy neighbors with cloud stress applications. So now when you look at the performance right so if you look at the baseline performance with 4.2 million packets per second as soon as you have noisy neighbor applications running the performance goes down by about 33 percent most of the time when you look at something like this a lot of folks start debugging ahead do I have enough CPU is my application well tuned is there enough memory so these are the first set of things that we would look at like kernel tracing and whatnot but very rarely that we look at something like last level cache and so I give you an example how to kind of look through something like a last level cache in order to debug performance challenges like this. So it's a busy slide and one of the important slides let's start from here so first thing is constructing your cache profile under idle conditions meaning under there's no noisy neighbor there's no other application under idle conditions we first get the characteristics of a cache footprint of your application like VM under test in this case test PMD. So we see that when we're getting the maximum performance for that VM the vSwitch Dmin is using less than 2.5 MB so the platform that we use is called Broadwell generation which is Intel Xeon E5 v4 and our PMD is about 12.5 MB per PMD and forwarding VM is about 2.5 MB. So under plain conditions you get your cache profile and now it allows you to construct like you know understand what is the kind amount of cache that you need for your workload. So let's look at packet flow in a simple platform right so you have packets coming in from your NIC and copy to the memory and then your vSwitch is classifying a destination and then they copied on to the memory space of the VM that you're forwarding to. So in this case you have these memory copies going on but what if you take the DDIO path data direct IO path so this has been platform has been supporting data direct IO for a long time now. So in this case the packets are copied directly from the NIC on to the last level cache. So under cache allocation technology what it provides you to is divide and allocate associated section of cache to your VMs or to your containers to your applications. So what we did was did some performance studies so first we ensured the DDIO is taken by the OBS with DPDK because that's where the polymer drivers are doing the heavy lifting of copying tons of packets here so we ensure that DDIO cache ways is given to PMDs and each of this VM that we have here so the cloud stress noise neighbor under idle conditions out of the 55 MB it can take up to 52.5 MB so it can take like 95% plus of your last level cache. So what we did is instead and also that's the main reason that you saw that 33% performance drop instead we kind of constrained it to just 2.5 MB so each platform has the minimum segment that you can divide the cache to it's called cache way and in this platform generation it's about 2.5 MB is the least that you can associate a cache to an application to so each of this VM gets 2.5 MB and based on if you add up the cache required for a V-switch infrastructure we gave about the combination of that to the V-switch basically the PMDs and the V-switch daemon and the cache allocation technology allows you to overlap the cache basically here these VMs are isolated for their cache and the V-switch is overlapping the cache with the VMs so based on that what we saw was you know so instead of isolating V-switch and the VMs we find a little bit of latency improvement when we shared the cache so lot of these details are in the PDF that's published on internet work builders but let's take an example of you know overlapping V-switch cache with the VMs and what you find was you know the performance pretty good improvement but before I jump there the kind of main idea here is you know if you kind of understand this there are a lot of limitations and combinations of how you can allocate your cache overlapping versus isolated and that's where you know if you look at a cloud perspective you can't log into every platform and kind of control the cache per numernode basis so instead we introduced resource management daemon and this software has been open sourced just about in 2017 so what happens is in a cloud network right so you have a traditional long loop where you have your metrics sent to a central controller and based on the policy the relevant control resource decisions are applied so instead we have introduced rmd to control these latency sensitive no real-time sensitive workloads in this case last level cache so where the idea is it's getting the information monitoring information and based on the policy that you set within the local node so you can enforce based on its resources and why is it why do we why do we care about it because you know it runs on individual nodes and it's latest one second is too long sometimes for something like last level cache so in millisecond level you can actually send in a rest api send in your policy based on the policy you can decide and act how to control your cache and architecture is very simple send in a policy and there's a kernel process that writes the relevant policy to the file system so with the us perf infrastructure we integrated both collectee and rmd so we send the policy to rmd at runtime saying i want to divide my cache in this way the example that we saw vms versus vswitch etc and rmd sends in the policy to resource control file system and allocates the cache in this way and we rerun the performance test yeah that's the idea so so that way we get the workloads in a orchestrator environment we need a automated way to understand what workloads they are running so that's it's coming up next so now if you look at the performance with the policy set with rmd there's a good jump in 40 40 performance compared to what we just saw in the presence of noisy neighbor so because we constrained the noisy neighbor to just one cache way which is about 2.5 mb so it's not affecting the cache related to your VM under test so thereby improving the performance to more than about 40 percent and so in this case the the kind of cache policy we said was like okay noisy neighbor you're just getting 2.5 mb and at the same time the rmd can allow you to scale the amount of cache that you want to give to your workload instead of 2.5 mb you can increase let's say you have more cache left increase it to 3 mb or 10 mb whatever size that you choose to and that's the benefit of having a policy based agent like rmd that can that you can scale based on the performance you can scale your amount of cache that you have allocated to so pretty much done so takeaways at high level you know noisy neighbor effects it need not be your cpu or memory there are shared resources on the platform that you have to care about like last level cache and rdt helps you out to do the llc quality of service control and rmd so when the opnv vs per page there are quite a few examples and test cases and we have recently done a demo that you can really see in live real time by changing a policy of last level cache association how your performance can take a big impact so you have a lot of resources and in terms of orchestrator like we were talking about you know so how would rmd understand the vms how would an rmd understand kubernetes we have a set of blueprints and introduced there's a lot of discussion going on online so feel free to review a comment provide us feedback so that you can actually leverage this into your nfe infrastructure and kind of understand how does llc can play an important role so that i'm pretty much done any questions yep do you assign cache to a process or to a cpu core you can do both so kernel 414 and above supports assigning cache to per process so you can actually trace your process can be moving across the cores if you want to and it's still the cache is guaranteed for that process any other questions thank you