 Hi everyone, my name is Helina and I work in Intel and I'm based out of Shannon in Ireland and I'm here to describe some collaboration work done between Intel and Nokia to integrate collecting and vitrage and then show an application of this work in noisy neighbourhood fault detection and correction. So today I'm going to give a bit of background onto why the work was done, then some information on collecting and vitrage technologies used and Intel Resource Director Technology is the technology that we're going to use to detect the fault on the noisy neighbourhood and then I'll describe how it all integrates together and then hopefully run the demo and describe what's happening. So why we decided to integrate collecting and vitrage was to enhance surface assurance on a virtualised platform. So to do this service assurance is essentially providing a good quality of service for predefined services. So detecting and correcting faults, degradations, violations on a platform will enhance surface assurance, so it made sense to do this integration. So CollectD is the system statistics collection daemon that is used to collect the statistics from the platform. It can be used to use on a virtualised platform, so this is why we chose CollectD. It's an open source project. To collect the statistics, it uses plugins, so you can write your own plugins, most of which are written in C. They read the statistics from the platform and send them to CollectD, who can then propagate them to other services. CollectD also supports thresholding and notifications, so there's a threshold plugin that you can set thresholds for other plugins that are reading the statistics where you can set the threshold for those statistics and if the threshold is broken, a notification will then be triggered and can be sent to other services. So the other services can then read these statistics by registering callback functions on notifications. To communicate with OpenStack Services, CollectD has a Python plugin because OpenStack services are generally written in Python, so it needs the Python plugin to be configured for the OpenStack service plugin. So Vitraj is the technology we're going to use as the fault management system. It is OpenStack's root cause analysis service, so not only provides event and alarm functionality, it also can detect the root cause of the problem. It can detect the cause so much that it can even detect the core that any VM that is created is pinned to, hence it can determine the root cause and also predict or determine the cause before it's actually been detected. So Intel Resource Technology is a technology used to monitor and control things like last level cache and memory bandwidth. So the monitoring technologies include cache monitoring technology and then memory bandwidth monitoring technology. The cache monitoring technology is what will be used to monitor the last level cache. To control these monitored cache and memory bandwidth, we then use allocation technologies. So cache allocation technology and the code data prioritization technology. The cache allocation technology also has a cool feature where you can define a class of service. So you define the class of service for your VM. You assign it the amount of cache that it is allowed to use. So if it's a higher priority process, it gets more cache allocation and if it's a lower priority, a lower cache allocation. So there will be no degradation in performance on the higher priority process. So how all of the technologies integrate together? CLECTE will be running on your virtualized platform. It will have the Intel RDT plugin configured. It will also have the Python plugin configured to read the CLECTE VITRASH plugin and it will also have the threshold plugin configured with a threshold set for the RDT plugin which will be monitoring the last level cache of the VMs. So if the last level cache goes below the threshold that's going to be set in the threshold plugin, the notification will be sent from CLECTE VITRASH plugin which will then post it to VITRASH itself. So now I'm going to run the demo and try and explain what is happening to do noisy neighbor fault detection and correction. So here we have the bare metal host command line. We also have on the top right is going to be your noisy neighbor below it to demonstrate how effective this is. We have a video stream that will be run on one of the VMs to show when the fault is occurring. The video stream will start to become interrupted and not work properly. We also have a VITRASH entity graph that will show us the whole deployment of the entire system. Then the Grafana display down on the bottom left has the last level cache values that are being read from CLECTE which will be running in the bare metal host. Then there's a display here for the alarms that will be triggered when the noisy neighbor is begun. So as you can see CLECTE has been started and it's going to check the data from the system. Check the COS so that class of service that has been assigned, it isn't assigned at the moment for either the media player or the noisy neighbor. So at the moment when the media player is running it's going to get all the cache because there's nothing, the noisy neighbor isn't being run. So the video stream will be clear and you can see the last level cache values that are being read from CLECTE down the bottom on the left. So then as you can see the video stream is running fine. The CLECTE is running and there's no COS class of service to find for either of the VMs. And down the bottom left you can see the top line is the last level cache values for the media player and the line across the bottom is the values coming from the noisy neighbor which doesn't have anything running yet. So we're now going to start to thrash the cache on the noisy neighbor. So it's going to increase its uses of the cache and you'll see the video stream will become interrupted. And immediately when the noisy neighbor has been detected the alarm will be triggered and you can see it in the VTRA's entity graph and in the alarm display down the bottom here on the right. So the alarm has been triggered over in the entity graph and it will now appear the CLECTE alarm and the VTRAGE alarm appear down the bottom. You can see that CLECTE has noticed that the threshold has been broken so it describes what has caused the alarm and VTRAGE can then specify what exactly has caused the VM to become what has caused the fault to occur as well. So this bit of the demo shows how the quality of the video will decrease and you can see the cache levels of the noisy neighbor will become high and the video player will become very low. So VTRAGE will then trigger an action that will allow a mistral workflow to begin. This will clear the alarms. It will also define a COS rule for the media player and the noisy neighbor so the media player is going to get a higher COS rule for more cache allocation and the noisy neighbor will have a lower priority so it won't get as much allocation. So even though the noisy neighbor will still be running the cache usage won't affect the media player. So you can see the alarms have been cleared and in a few seconds the video stream will come back and be as clear as it was in the beginning. And you can also see down the bottom left that the cache usage for the VM has, or the values for the cache, the last level cache of the VM has increased again and for the noisy neighbor and the bare middle host aren't using as much. This is because we have defined the COS for both and you can see up on the host where the COS is defined and it is two for the media player and lower for the noisy neighbor. So that was the demo. So just to summarize how that worked so the noisy neighbor is detected via the Intel RTT plug-in reading the last level cache from the platform and triggering the notification which has a threshold set by a collect D as well. This notification is then propagated to VITRAGE where VITRAGE determines the cause of the fault and does its root cause analysis. It then triggered the corrective action so it defined the COS rule for the cache allocation technologies used by the Intel RTT plug-in to define the higher COS rule for higher COS priority for the noisy neighbor. So below is a link to the demo which is online so anyone can go and watch it again if they want to read more into it. I would also say this is a prerequisite to a presentation that is going to be carried out this evening which I would encourage you all to go to. It will be carried out by Maryam from Intel and Afaf from Nokia. Also if you want to learn more about Intel RTT technologies the Intel Booth has a demo running at these times for that technology and if anyone wants to contact me or ask me questions they are welcome to and that is sort of my presentation.