 Hi, everyone. My name is Mariam Tahan. I'm a network software engineer at Intel. And today, I'm going to be co-presenting with Ifat Afik from Nokia Cloud Band. She's a system architect there. And she's also the PTL for the Vitridge project. So today, we're going to be presenting and showcasing Noisy Neighbor use case and its solution. So I'm just going to introduce the use case to you guys. And then we're going to dive into a demo. And then we're going to dive into the technologies that comprise the solution that resolved that use case. So the use cases are, I suppose, the question we're trying to answer is, in an open stack environment, how can we detect and correct a noisy neighbor without migrating a workload from the compute node where it is in operation? And like all good things in life, the answer is a three-parter. So from a detection perspective, we're going to leverage Collecti, which is a system statistics collection demon, and Intel Resource Director Technology, or RDT. Intel RDT is a set of technologies that allows you to monitor and manage the shared resources on the platform, like last level cache or memory utilization for applications sharing the same socket. So using a combination of these, we're going to detect a noisy neighbor running on a compute node. Then we're going to move into the notification phase. And from a notification perspective, we're going to propagate a notification from Collecti to Vitridge, where the problem and its impact is going to be root-caused, visualized, and exported. And finally, once we've moved past that phase, we're going to move straight into the remediation phase, or the carrying out of the corrective action. And for that, we're going to use a combination of Mistral and Intel RDT's allocation technologies. But before I just dive into a demo, I want to give you guys some context on what exactly we're going to be showing. And maybe a little bit of background information that'll help fill some of the gaps so we can understand what we're seeing. So the demo itself consists of a control node and a compute node. On the control node, we have Vitridge. And we're clearly going to be able to see the entity graph and all of the elements that are in operation on the compute node. And Mistral is also running on the controller. From the compute side, we're going to have three virtual machines. The first one is a video server VM that's going to stream a video to a video client VM. And the third VM on the compute node is going to be a virtual machine where we're going to run a stressful application. Each of these virtual machines is pinned to an isolated set of cores. And we're going to monitor the last level cache utilization for each of those cores, leveraging collectee and Intel RDT's user space library called libpqos. I suppose what's kind of key to point out here is the two applications that we're really, really interested in are the video server and the video client. And we're going to showcase how the quality of the video will degrade when we run a stressful application and how it comes back to normal once the corrective action takes place. So under normal operation conditions, we don't see anything. But when we start running a stressful application within the VM and on the compute host, what we're going to see is a significant reduction in the last level cache utilization for both the video server virtual machine and the video client VM. This is going to cause a notification to be generated from collectee to vitrage. Vitrage is going to also notify Mistral. Once it's done its RCA, it's going to notify Mistral of an alarm on that VM. Mistral is going to trigger the corrective action, which is taking advantage of the user space library that's available on the compute node again. And I just want to explain what that corrective action is at a very, very high level. IntelRDT uses this concept of class of service. So this concept can be used to isolate applications that you're really interested in and prioritize their last level cache utilization from other applications running on the same processor. So we're just going to reassign a higher class of service for the two VMs that we're interested in. And we're going to see the normal video operation resume and the alarm getting cleared on the vitrage side. So I'm just going to switch screens here. To just give you the lay of the land, we're going to see the host command line here, where we're going to start the collectee service. We're going to see the stress VM command line here, where we're going to start the stress application in a while. Here, we're going to see the video stream being fed to the VLC client VM. We're going to be able to see the vitrage alarms when they occur. Grafana is going to show us the last level cache utilization for the various components of this demo. And we're going to also see the vitrage entity graph that's going to show us each of the VMs, the cores they're associated with, and any alarms that get triggered or generated as a result of noisy neighbors. So we can see the video has already started. The quality is pretty normal there. We're going to start collectee. There's also a link to this video where you can access it on YouTube. If I zoom in now, I think we'll lose vision of the rest of the screen. But I'll make sure to share with you guys. You just have to trust me first for some of the command lines, I apologize. So on the host here, we're just literally starting the collectee service. What we're going to see is Grafana starts to display the demo at all. Apologies. No, one moment. Oh, sorry, folks. There we go. OK, I suppose that's one hiccup we can now get over and restart the demo. Apologies. So on the top left-hand side here, we're going to see the host command line. Moving clockwise, we're going to see the stress guest command line. We're going to see the video stream being fed to the VLC client VM. The Vitridge alarm screen is going to be displayed down here. On Grafana, we're going to be able to see the last level cache utilization, as I mentioned. And on the Vitridge entity graph view, we're going to be able to see each of the virtual machines so we can see the Ubuntu stress VM, the Ubuntu VLC client VM, and the Ubuntu VLC server, and the cores they're attached to. And when an alarm is generated, we're going to see that as well in this view here. So the video starts. We're going to start the collectee service. And we should see in Grafana, in a moment, some of the metrics coming in. And actually, if I just pause it here for a second, what we're going to see is on the bare metal host, there is a class of service of 0 assigned to all of the cores. So 0 means there's no class of service. So applications are fed on a first come, first served basis. On the Grafana view here, we're going to see the VLC server VM last level cache usage in green. The stressNG VM usage, last level cache usage. And again, there's nothing there at the moment, so you can't really see any metrics. It's on the zero line. And the host's last level cache usage is in blue. So we can see from the Grafana side, the last level cache usage is mainly used by the VLC server virtual machine. And the video quality is relatively OK. So now we're going to start stress on the bare metal host and stress on the VM. And almost immediately, we see an alarm getting added into the entity graph view. And actually, if I let it run for a second more, just because the refresh rate is a little bit different on the two alarm views, what we notice is Collectee actually sends only one alarm to Vitridge. And that's to do with the core usage. What's really important to note is that actually, Vitridge raises a second alarm. It deduces that there's a noisy neighbor situation that's affecting the VM that's attached to those cores. So Collectee only sends one alarm. Vitridge generates the second alarm. Now the video quality, now you need to remember with video streams, there's some frames that are buffered for a period of time. So what we're going to see is degrade and disappear while the stress application is running. And when Vitridge notifies Mistral to carry out the corrective action, everything should resume to normal. But again, it takes a couple of seconds to buffer up enough videos for the video stream. On the Grafana view, what's interesting to note is we can see that the last level cache usage has dropped significantly. And the host, the bare metal host last level cache usage has spiked up significantly. And we can even start to see some of the stress VM cache utilization also increasing. So once buffered frames are gone, the video starts getting a little bit choppy and disappears. Now Vitridge triggered the Mistral workflow immediately. The alarm is cleared when the usage now goes back to a normal level for both the VMs. So we see a small bit of a delay there. And now we're gonna see the video come back into normal operation, even though stress is still running on the host and stress is still running on the guest VM. So it's coming and we're back and stress is still running. On the Grafana side, the last level cache usage for the VLC server has gone back up. The host is now limited to how much cache it can use. So that usage has gone down. And if we just stop the stress application running on the host and check the class of service for the VMs, we're gonna see that for the video server, which is tied to cores 37 to 40, the class of service is set to one. So that's a higher class of service. Then what is configured for the stress and GVM, which is a class of service of three. And I'm gonna show you in a moment just the class of service for the client VM. And that's set to a class of service of two. So we've pretty much isolated the two applications that we were interested in. We've prioritized their last level cache utilization without having to migrate any workloads off the compute node. So apologies. So now if we're just to dive in a little bit into the technologies that comprise the solutions, I'm gonna focus on the monitoring side and then if that is gonna talk us through the Vittorch and Mistral workflow. So Collecti is a system statistics demon that is more than 10 years old. It has quite a modular architecture. It's easily accessible via plugins. And basically outside of the actual demon itself, everything is a plugin. So you have plugins that read telemetry off the platform or events, and you have plugins that publish those events and telemetry to various endpoints. It supports thresholding and notification. So if you're interested in a particular value, you can set min and max warning levels. You can set min and max failure levels and you can even set notifications to clear those warnings if values fall back into acceptable ranges. It's platform independent as well. So typically what you have is the main Collecti demon. You have your input and output plugins and some plugins work both ways. Like the networking plugin allows you to configure things in configure Collecti in a client server mode. If you want to aggregate metrics off a set of clusters or cluster of machines, you can do that. But in the barometer project, we've been particularly focused in on the read plugins rather than the plugins that go both ways. And we've been really, really focused on enabling the relevant metrics and events for capacity planning, trending and operational status of the NFVI and being able to export those metrics and events to the VIM, to Mano. So we've been extending Collecti with a number of plugins like an A plugin, an Aki plugin, a Solometer plugin, as well as the relevant plugins to monitor the subsystems that we're interested in. And one of the technologies that we think is key for placement decisions, for adjustments to scheduling policy and for resource awareness is IntelRDT. And as I mentioned, IntelRDT is actually a set of technologies. It's not just one technology. It's used to monitor and control the usage of the shared resources, in particular the last level cache and the memory bandwidth for processes that are sharing the same processor. The technologies are subcategorized into the monitoring technologies, which are CMT and Cache monitoring technology and memory bandwidth monitoring. And in the Barometer project in OP NFV, we've actually enabled a plugin for IntelRDT that leverages those two monitoring technologies and relays the RDT stats to any endpoint that is supported by Collecti to date. And the other categorization of technologies is the allocation technologies, CAT and CDP, which allow you to control through software where data is allocated into your cache, prioritize applications, and isolate them from any other applications that are running on the same processor. And what we saw in the demo was actually just CMT and CAT. So CMT just to relay the metrics and CAT to carry out the corrective action by reclassifying the class of service for the VMs that we're interested in. From an actual operation perspective within the demo, we simply had the Collecti daemon running at a particular interval. It issues a read from the Collecti RDT plugin. The Collecti RDT plugin retrieves the metrics and publishes or dispatches the values that it collects back to the Collecti daemon. Those values are then generally published to write plugins. But in this case, we've configured a threshold for the two VMs that we're interested in monitoring. And because the threshold was hit in the case where we activated the noisy neighbor, a notification gets generated and sent to the Collecti Vittorch plugin, which then propagates the notification further to Vittorch. I just want to point out that the Collecti Vittorch plugin was developed by the Vittorch team and actually lives in their GitHub. The Collecti RDT plugin was developed by the barometer team and actually lives in the Collecti GitHub. And lastly, the Intel RDT user space library is available through GitHub as well. And all the links will be available in the presentation afterwards. So now I'm going to hand you over to Ifat to talk us through the Vittorch section. Hi. So Vittorch is an official OpenStack service for root cause analysis. It is used to organize, analyze, and expand the OpenStack events and alarms. A cloud operator that has some fault in the cloud may see a very large list of alarms, and it will be hard for him to understand what is the root cause of the alarms. And this is where Vittorch can help. Another role of Vittorch is to report alarms on problems that are not directly monitored in the system. For example, in case of a physical NIC failure, Vittorch can identify the VMs that are affected by this failure and report that they are currently unreachable. Vittorch provides an holistic and complete view of the system, including the physical layer, the virtual layer, and the application layer. And you can clearly see the relation between these layers and the effect that they had on one another. I'll talk a bit about the Vittorch architecture. Vittorch collects information from different data sources. Some of them are OpenStack components like Nova, Cinder, Neutron, Heat, and also AODH, which is telemetry alarming service. Others are external monitors, Zabik, Snagius, and Collegdi. And all this information is collected and inserted into a topology graph, the entity graph. In the graph, you can see all the resources and the alarms that Vittorch collected in the system and the relationship between them. When the graph is modified, Vittorch evaluator checks if there are actions to be taken. An action could be to raise a new alarm to mark causal relationship between existing alarms or to modify a state of an object. The logic and the rules of when to execute these actions are defined in the templates, which I will describe soon. When Vittorch performs an alarm like raising, performs an action like raising an alarm or modifying a state of an object, it notifies external systems like Nova or SNMP alarms. In addition, Vittorch has an API, a command line interface, CLI, and the user interface that is part of the Vittorch, the horizon dashboard. Mariam showed you the slide of the overall flow of the demo, and I will now drill down to the Vittorch part. So Vittorch received an alarm from Collegdi about the problem on the core. It added the alarm to the entity graph and checked if there was a matching template. The template as it was found said, if there is an alarm on a core and there is an instance that is pinned to this core, you need to raise an alarm on this instance as well, saying that it suffers from performance degradation because of its noise enables. Vittorch raised an alarm on the instance and marked the causal relationship between the two alarms. Next, Vittorch notified Mistral and opens the core flow engine about the problem on the instance and Mistral executed a workflow with corrective actions. Vittorch template is where you define these rules. A template has three sections. The first two are the entities and the relationships where these are the building blocks and the scenario is where you actually define the rule. Each scenario has a condition and an action or several actions. In the demo example, one condition was that if there is an alarm on the core, the action was you need to modify the state of the core. Another condition said that if there is an alarm on the core and there is an instance pinned to this core, you need to raise an alarm on the instance and modify the state of the instance. This is how Vittorch helps identify that there is a problem on the instance while Collegdi reported a problem on the core. As a next phase, we are currently working on machine learning algorithm where you can automate the process of generating these templates. So we can examine historical arms and generate more templates and more causal relationships. If you want to learn more or see live demos, you are welcome to knock your booth. Thank you. Back to you, Mariam. Yeah, so just to summarize, we've basically showed you how to address or queers a noisy neighbor by leveraging a number of technologies from a detection perspective, Collegdi and IntelRDT from a notification perspective, Vittorch and particularly that the problem was and its impact was root cause, visualize and exported there. And finally, we also showed you how we carried out some remediation or corrective actions using a combination of Mistral and IntelRDT without migrating any VMs off the compute node where they were in operation. So if you're interested in ensuring that our relevant platform metrics for supporting capacity planning, trending and operational status are available through a standard, an industry-centered interface. If you're also interested in demonstrating how platform technologies can be monitored, consumed, root caused, how alarms can be deduced, as well as state and action in real time, then I invite you to come and join us in both the Vittorch project and the Barometer project in OP NFV. And I'm just gonna draw your attention to a couple of other presentations that are happening this week around IntelRDT and some of the collective work that we're doing. And there's some useful links just at the end of the presentation. So we're pretty much happy enough to take any questions. I would invite you to use the microphones. Hi, this is Groprit from Spire and Communications. Hey, Maria. Question, I assume that Vittorch is, like you mentioned, it's one of the OP NFV projects. Hi, Vittorch is an OpenStack project that is collaborating with OP NFV, with Barometer and with Dr. Projects. Okay, so it's open source then? Yeah, fully open source. Okay, and based on this demo, I assume that we're making the assumption that the last level cache is the only factor which is causing the noisy neighbor impact and the other factors have already been eliminated. Yes, I mean, just for the purpose of the demo, we wanted to showcase a particular use case and how you could do local corrective action for the noisy neighbor use case where the last level cache is the only issue, yeah. Hey, just a quick question and maybe a bit of a side question, but have you looked at using the same technology, including CollectD and RDT, to actually monitor the control plane itself as compared to the applications? I've been looking, for example, for CollectD plug-ins for OpenStack services and they're pretty hard to find. Yeah, so we haven't really been looking at monitoring OpenStack services because a lot of them publish their own metrics directly to the bus and our focus has really been very specifically the NFVO, so where the VNFs are in operation. If we see a need to collaborate on that in the future, we can certainly consider it, especially from a VNF event streaming perspective. Yeah, sort of an objective observer of the sort of metrics that you talked about which wouldn't be invisible to the services themselves might be something interesting to consider. Thank you. Hi, actually it was the same question. Okay, so yeah, from the operator point of view, we all need to have like an end-to-end RCA, so we would really appreciate if it can accept or take any kind of KPIs from the VNF, VNFM so that it can correlate the whole stack. This will be perfect. Yeah, so actually in CollectD, we've been enabling quite a number of plug-ins. I apologize. So not just RTT, but we're looking at stuff from an IPMI perspective, from a hypervisor perspective, so we're supporting Libvert. From the platform perspective, so we speak to certain monitoring units on the physical platform to retrieve some of those stats. And we even monitor machine check exceptions and various other components. I would really invite you to visit the Barometer Wiki. We actually keep a list of all of the statistics that we maintain for any of the plug-ins that we've enabled for both actually not just statistics, but events as well. Okay, thank you. Just one quick question. I didn't get what was the corrective action in the demo. The corrective action was, so normally when the VMs were in operation, the class of service was configured to zero, which means there was no class of service configured for any of the applications running on the processor. In that case, the utilization of the cache is just on a first come, first serve basis. So when we ran StressNG doing matrix multiplication, that just started consuming all of the cache lines. What we did was IntelRDT actually allows you to allocate a new class of service for the application that you're interested in, which allows you to prioritize and isolate a section of cache for it. And so nothing else running can consume that cache. And that was what the corrective action was. There was a presentation earlier from the OpenSack Incidents HA group, which they were looking for event correlation and also they are invoking Mistral to do remediation. I just wondered if you've done any work with them or spoken. I wasn't aware of this presentation, but I will check it. It's interesting. Yeah, it was earlier today. Thank you. Okay. Okay, thank you very much, folks. Thank you. Thank you for your time. Thank you.