 Yeah, okay. Thank you for coming. Okay, we're gonna present dpdk collectee and cell meter stuff. It's written to the NFV and the subtitle is the missing link between my telco cloud and NFV instruction. So we're gonna present with Mariam and Emma can make to come here but and Carol's and me, Ryota, Mibu, from NEC and It's It's about the cloud. So Crowd is very nice. You can get servers and you can run on the your services on top of it without the notion of the back-end technology or the physical servers availability whatever. So it's very nice but You might notice someone eating your cloud. Your resource will may stop or disrupt it with someone else. Maybe it's the noisy neighbor or yourself or the physical failure of the servers whatever and And then you might realize that your cloud will be showing or damaged and also your service will be damaged and That don't mean it's very bad thing to you and also for your end users So we need more enhancement or your ability to monitor or see your cloud what to identify what's happening and In the most telco case as you may know there's a small and medium size of the deployment and we'll be connected in one big place with large scale cloud so You can you may have the hundreds or hundreds of thousand infrastructure systems including servers or the modern services as well and For the future and IOTs you may have more devices and client and increase the subs demand and also the network traffic and in these situations you still may be asked to remain the same level of service quality and Expectation from the user might be the same so you have to make your service available when whenever and wherever the Service or the Service type or the network was grossed so and the monitoring plan is is very important for us to Identify the malfunction of your cloud and overall system performance and So we think it's very important for us to Expanding amount of data availability And also the improving alarming functions in our cloud So we need telemetry telemetry is a cornerstone for billing benchmarking intelligent orchestration and for management So this is very important place for all of us So let's Pass to the Mariam sorry cameras to use case and more technical details So as we are the mention now everything is on the cloud and there is IOT coming to the cloud There is a lot of devices connected a lot of broadband everything and we need telemetry to Be able to charge properly our customers do some benchmarking and do some intelligence orchestration and fault management So in this session we are going to focus a bit more on fault management as one use case and specific to in telcos so one use case in telcos is to Deploy the services in an active standby manner, so we can have One VNF It will come later, but in an open-sock deployment. We have the controller nodes. We have the many compute nodes And also networking nodes and all that And what we want to do is to monitor the network So and in this case we can monitor The the link of the network so physical resources and also virtual resources using dpdk and that's the example that we are going to to give here and and we will Collect quite a lot of metrics out of samples and push it to the to the cloud to the to the controller and based on that We can do some intelligence decisions We can have performance metrics we can we can we can detect Failures happening on the network side, etc so in this case we will have The control node and two compute nodes. We have cilometer We have collect the collect these a monitoring system and also obvious with dpdk So obviously with the dpdk will be monitoring the the the links It will report to collect the collect the will push all of these these these metrics to cilometer and then later If there is something wrong, for instance, we can notify the user users through a AODH as you want to call it so At first we have everything the call deployed everything is is ready and and running so the link status is Okay, thumbs up on that Perfect So again in the telco we have Many cases where we have a Service virtual network function DNF deployed in one compute node and the other one in Sun by mode in another compute node And so at first everything is working But if we detect something On the physical or virtual virtual infrastructure that is going wrong some Someone just was stupid enough to plug out the cable without doing some Some actions before Then it's important to detect this So it's a big problem. Okay What should we do in this case in the telcos? we switch from the active VNF to the standby one and The problem is It's not easy to to to detect this with obvious TPTK With a collect the and now with the alarming functionality in in open stack That's a bit more convenience If we simplify that and we will show a couple of blueprints in in open stack in different projects where we were able to monitor do some orchestration Alarm the the user and the user can then Quickly react to this event So I will pass to Mariam To explain TPTK stuff collect the and all that Hi folks, my name is Mariam to Han. I'm a network software engineer at Intel And I'm also the project tech lead for software fast path service quality metrics in opnfe By the way, I'm only going to say that once so it's going to be SFQM for the rest of the presentation For those of you who aren't Aware of what DPDK is it's the data plane development kit and that is an open source project That provides the utilities libraries and drivers to enable fast packet processing in user space And as Carlos mentioned collectee is a system statistics collection demon today I'm going to give you a quick overview of SFQM We're going to look at some of the features that we've been implementing in DPDK and collectee and how you can use those features to pull statistics from DPDK and relay them all the way back to So let's dive straight into it The ability to measure and enforce telco KPIs in the data plane will be mandatory for any telco grade NFEI Implementation to meet this requirement SFQM has been developing the utilities and libraries in DPDK for three things Firstly the ability to measure telco traffic and performance KPIs such as packet delay packet delay variation and packed loss Secondly the ability to monitor the performance and the status of your DPDK interfaces and thirdly The ability to detect and report violations that can be consumed by higher level fault management systems Today's talk focuses on the monitoring and the performance Monitoring the performance and status of the DPDK interfaces So what are the SFQM features that are available to allow you to do that? Well, actually the features fall into two categories firstly DPDK features and secondly collectee features on the DPDK side We implemented an extended statistics API. This was a predefined API in DPDK We simply implemented it for the drivers that were available the 1 gig the 10 gig and the 40 gig Nick drivers and What this extended stats API does is that it augments the generic statistics API with the generic stats API today You get the aggregated stats right across your lower level registers So if you're looking at your error registers in the through the generic stats API You're not necessarily aware of which of the underlying error registers actually causing your aggregate count to increment And this is where the extended stats API comes in it allows you to look at the hardware level statistics registers essentially On your Nick and expose that information all the way back up to the DPDK application So now you can see exactly what the cause of the errors is is it undersized packets coming in your system? Is it CRC errors and you get a much more detailed view about what's going on there on? the collectee side we implemented two plugins firstly read plugin called DPDK stats which pulls the statistics and link status from from DPDK using the extended Nick statistics API for the stats and generic link status functionality from DPDK for link status and The second plugin we developed was a salameter plugin and this salameter plugin is capable of relaying any of the read Statistics that are pulled by collectee off your system to salameter So it looks after the transformation into salameter samples and post them to salameter. So How do these actually work together? Well, the idea is that you would have something some DPDK application Running on your compute node in the case of our example. We have OBS with DPDK It has a VM attached to it and it's doing some sort of packet processing on that compute node collectee then runs as a service on that compute node and collectee is usually configured to run a particular interval to pull statistics off the system so When the interval triggers the collectee process initiates a read on all enabled read plugins in this example DPDK status enabled it goes off it retrieves the statistics from OBS with DPDK And it also retrieves the link status for any of the interfaces that are bound to DPDK DPDK stat then dispatches those values to the main collectee process Which then pushes those values to all enabled write plugins in the case of our example the salameter plugin The salameter plugin then transforms those samples into salameter samples and post those values to salameter So let's dive into the details of the two plugins and how they little bit and how they work. I Apologize about the coloring. I should have gone for something a little bit more visible, but nonetheless, I'll walk through it How the DPDK staff plugin works as I mentioned, it's an input plug-in so it reads statistics off your system But what we had to do was actually fork off DPDK secondary process and that's the DPDK helper process you can see on the diagram there Why we need to fork off this process is Because we don't want to have to restart the daemon whenever your DPDK application, which is usually your primary process dies now the difference between a DPDK primary process and a secondary process Is that a primary process can initialize the shared memory and the resources for DPDK? Whereas the secondary process simply attaches to that primary process shared memory can't do any initialization But There is no process. There's no functionality within DPDK at the moment that allows you to release all of the resources once your To release all of the resources as a function call to the DPDK API so for that reason we have to kill The DPDK helper process in the background when a primary process dies So allow me to just step through it step by step When your DPDK application starts your DPDK primary process it actually initializes a configuration file It holds a right lock on this configuration file when your DPDK secondary process starts It actually holds a read lock on this configuration file So when your primary process dies In order to be able to restart another primary process on your compute node You need your help your secondary process to release that read lock that it still holds on the configuration file So as I mentioned, there's no graceful way to allow us to release this from the DPDK side so what we do is we Kill the process we let the kernel look that look after the cleanup and releasing that lock for us in the background and now Voila a new DPDK primary process can start up on your system and you don't have to kill the collectee service because this the secondary helper process was already killed and you can just spawn off a new one to go off and retrieve the stats for You again, so you might be wondering How do we actually share this the statistics between the DPDK helper process and DPDK stat So the answer is actually through a POSIX shared memory object When DPDK stat is initialized it initializes a POSIX shared memory object and it also initializes two semaphores It looks for a DPDK primary process and once one is detected to have been kicked off or started up So you have some sort of DPDK application running on your platform It forks off DPDK the DPDK helper process now the DPDK helper helper process Mem maps that shared memory and it's capable of writing to that shared memory But it only writes to that shared memory when DPDK stat kicks a semaphore Indicating that hey, you need to go read some stats for me the DPDK helper process reads the statistics into shared memory and it also retrieves the link status and Then it kicks the second semaphore Indicating to the DPDK stat plugin as hard to the DPDK stat process that okay now I have the stats I have link status you can go and dispatch these values to the main collectee process so now we have Retrieved statistics at this point from DPDK as far as the main collectee process So we want to write these these samples that we've collected to Cilometer so what we did was we implemented a Cilometer plugin that takes advantage of two things firstly Cilometers restful API that allows you to add custom meters custom meters to Cilometer and Secondly the Python plugin the Python bindings that are available to you within collectee so now the when when collectee starts and the Cilometer plugin is enabled it registers a right Callback with the main collectee process Once the main collectee process has retrieved all the relevant samples from the read plugins it Kicks or I suppose it calls all of the callbacks In this case the collectee Cilometer plugin callback for writing and as part of that process the plugin transforms the samples into Cilometer samples Edit does a HTTP post directly to the Cilometer API So now we have passed Link status and statistics low-level statistics all the way from our physical nick to Cilometer at this point Maybe not this point shortly. I would hand you over to Carlos who's gonna talk about what happens once those stats are passed on so for us There are some limitations to the plugins that we've enabled today So yeah, we have a bit of work to do on the link status side What we would like to do is take advantage of the notification plug in architecture and collectee To simply post an event when link status goes down Rather than passing it as a generic statistic to Cilometer today And on top of that then adding an alarm separately. So we'd like to be able to post notifications directly to AODH the other thing we'd like to do is Relevant to feature we call dpdk keep alive and Dpdk keep alive is a heartbeat mechanism for packet processing cores in dpdk. It protects against stalls or sorry, it detects stalls or application threat failure within your dpdk application So we'd like to be able to expose the core liveliness or the core deadliness in this case all the way to AODH. So again, we'd like to take advantage of the notification Plugin architecture in collectee to allow us to do that Some things you'd like to do in the future are performance scalability and aggregation analysis For what we have we'd like to do some knocky integration as well and in the future even Look at developing a plugin for open vSwitch With or without dpdk. So hopefully it will be an agnostic open vSwitch plugin that is capable of retrieving not just port stats, but flow level statistics in the future and this is the end of my presentation So we are now going to talk about doctor it's an open fee project and later on we will show a quick demo of How these two projects SFQM and doctor are collaborating and how we are learning from each other and fitting back each other So in the beginning of the presentation We said that telemetry is important. It can be used for benchmarking billing and also fault management and doctor is addressing these these these other case fault management, so It's fault management and maintenance framework We are working in open FV as part of the telco space and the main For actions that we do is identifying requirements so we work with operators and also vendors to understand what are the Major requirements that we have to address for fault management and maintenance Also, then we do some gap analysis. So we study open stock the different Projects in open stock. We also do gap analysis on some other projects outside of open stock And then based on these gap analysis We identify some something some some actions that are needed to be implemented some some some work So we implement and we push everything upstream so to open stock To the BDK to collect the so everything is open source and and upstream first and Later on we as part of open FV projects. We take all these projects With our code base we integrate and we test So that in the end we have an open FV release we can you can just download install and you have everything Installed integrated tested everything will work. Hopefully The four key requirements for doctor is consistent resource data awareness This one is about having the the exact states of our cloud So that when there is some some failure happening in our infrastructure, we are aware of these these states This is important because So far there are some some some Components in open stock that do not Detect some some failure so that they they are saying that their state is up Where they are not because they're the thing that there is something causing a malfunction there The other one and very important is immediate notification so as part of our gap analysis we detected that Solometer was not able to quickly notify the the user and here the user can be the cloud operator the code administrator can be In Etsy terminologies can be the VNF manager can be the NFV orchestrator or anything sitting on top of open stack So we are talking about toco But it can be used for anything enterprise whatever And with our with this item immediate notification that we have also addressed We could notify The the user within one second whereas before it was a couple of minutes, but I will talk about that later on the third Key requirement is extensible monitoring. So we understood from talking with different people that there are different people using different monitors monitoring systems and Many of the time they have even not just one single monitoring Platform running matching the the the visual and physical infrastructure, but also two three four five ten whatever So it is important to have these This platform this framework that can support multiple monitor solutions and also Create this self-bound interface so that we have a consistent and And standardized API there The last one is fault correlation is also very important so that when we have a lot of alarms coming a lot of triggering We have to be able to detect and to know from where what was the root cause of these failures so Sometimes there is one link that fails in the network sites and there is a lot of more links failing because of that One is specific. So if we get a thousands of alarms because of network failures event, we want to narrow down to the exact problem. What was the root cause? so we do root cause analysis and also we have a programmable dynamic framework that can accept policies and so that if there is an event We can immediately call enforce a policy So These four key Requirements can map to the doctor Framework architecture on the right-hand side. So we have the controller part Which are the normal open-stock services that we know Nova Neutron Cinder, etc. We also have the notifier that is responsible for notifying the the user and that's in open-stock Iodh It's part or it was part of the salameter of the alarming functionality the monitor and the inspector a Use case is this one. So we have a an event in our NFV infrastructure Let's assume it's a OBS the PDK interface That that there is some some problem with that we can notify we can detect we can notify and so we pass We can notify through the monitor the inspector of this failure Based on policies we can Find what are the affected? Resources and based on that we can also update the their state so if For because of a network failure We cannot access any any longer a compute node for instance then all the virtual machines running on that compute node Have to be marked as in an error state for instance So we do that in sub three and in sub four When there is this event The the user is notified that Is or her virtual machines are now in an error state and this is an immediate action an immediate notification now so mapping these four Generic Platform agnostic let's say Components to the open cycle ecosystem. We can see that in the controller side We can be talking about Nova Neutron and Cinder on the notifier a ODH Inspector we can have a couple of them depends. We are studying so we can have Congress We can have vitreyes. We can we can have monaska on the monitor. We can have a couple of them as well Zavik's collect the Nagios or we have plenty of of them available On the open source world and also proprietary ones our work in the doctor projects for the Liberty and Mitaka release Was mostly focused on the on the first two key Requirements consists resource state awareness and immediate notification Two important blueprints that we proposed and were accepted and merged in Liberty and the doctor releases Were these two ones the event alarm and the state correction? So for the event alarm We During our gap analysis, we noticed that It can take seconds to minutes to be alert of A an event something that is wrong or we exceeded a threshold or there is a server down things like this Why because there were two pulling mechanisms in place And if we are out of sync it's even worse so it they can take couple of minutes So what we proposed was a events driven alarming functionality where Components like Nova for instance could immediately notify send this event to the message bus and adh is listening and Taking this message and immediately alerting the user. So We went from Several seconds to minutes to within one second and that was a really important requirements for the operators Because of the app if standby switch fail over fail over so we don't want our service to be done for couple of seconds or minutes The other second blueprints we implemented Was in Nova the state correction or you must probably you you have heard of the Nova Marcos down API In this case if we have a an external monitoring solution monitoring the the compute host We can Probably detect a lot faster or with a lot more flexibility and we can monitor more things Then just the Nova compute service itself and Is so if we detect something is wrong We can now through the this new API mark these compute nodes as down and that will trigger an Event to the message bus adh can take it and if the user configures can Get these alarming From the project creation to the Brahma puter who is released in March 2016 We have worked on adh Nova and we are also working on a couple of more projects Namely neutrons in there and also Congress with rage We have we have also integrated Zabix and the PDK in the open FV platform And the important part here to mention is that in between March 2015 and September 2015 we worked upstream in OpenStack To by proposing these blueprints and also have them accepted and implemented and we managed to do that and Later on after September 2015 to March It was integration phase testing and and all that and just waiting for open FV to be released to release the next version These are three blueprints implemented in liberty Important here to mention Is that we are a Projects that consists of multiple companies vendors and operators So we welcome everyone to join this case and you see Intel and Nokia Participated in this effort from open FV Brahma puter release and the Liberty and Mitaka Cycles to the next one We extended our focus of contributions We are now also working on the monitoring parts and the inspector parts So that's where the PDK called the And so long there comes in in the monitoring part. We also had some two other presentations on the inspector during this open OpenStack summit and as so Yeah, Zabix collect the Nagios we can support all of these Congress with rage etc And updated table so far of implemented completed the blueprints are these ones as you can see again Multiple companies jumping in and helping each other and Putting everything together as FQM plus doctor. We can see that or we can get a Very good view of our platform in terms of Metrics and the current state of the network and Based on that we can modify the user in case of any events affect affecting their Their resource virtual machine storage networking So we have a quick demo just the video Where we will show a Link status check so the PDK is checking the The PDK interfaces we have the folded action propagation. So from the PDK to collect the solometer the inspector component We will update the Innova compute service because we are saying okay this interface is is is done. So there is no connectivity There is no no needs to have these compute nodes In this in the schedule in the scheduler. So let's just mark it as down and update the the current state of all the virtual machines running on that compute node and alerting the user So that's resource it correction alarm and then later on Do the IP stand by service switching? So we will start It's pretty small So in this window We are just preparing our platform. So we are Booting some virtual machines in compute node one and compute node two in compute node one We have the active stand the active VNF running. It's just an HTTP server and on compute node two. We have the the second VNF so if I skip it Yeah, just starting On the top left side We have we are monitoring we are getting these these the current state of the interface link from a selector The the its value is one. So it's good On the next one we have the current state of the Nova compute nodes So compute nodes one is up and On the second on the third one. We have two virtual machines server one in compute node two one and Server two in compute node two and they are all active on the top rights and sides We have just a client Refreshing the page is being served by being served by server one. So everything is okay And on the other Window the application manager blog. It's the the consumer that the operator. So you will get notified of events And then do the switch over So there was the event Coming so basically here dpdk detected a link down So it reported through collect the two cylinder. So it's only now is zero the the current state of the compute node is now down and The virtual machine is an error state. So based on these Events we notified the user and he switched to the To the standby node Sorry, just to summarize. I wanted to share with you a bit of an after analogy About the work that we're trying to do really so trying to manage a complex cloud solution without proper telemetry And without proper telemetry infrastructure in place is like trying to walk across a busy highway with blind eyes and deaf ears You have little to no idea of where the issues can come from and no chance to take any smart move without getting in trouble so what we're trying to do by combining the solution between dpdk collect the open stack and doctor is Is essentially painting the pedestrian crossing so putting the building blocks in place to allow us to Not cross the highway blinded that So I really encourage you folks to come and join us in opnfb in the doctor project and SFQM if you're interested In contributing there. Thank you very much for your time