 All right, thanks for coming everybody I'm Dale source be with itential says Bill coward Cox communications And I'm Michael Evan check with itential and Echo we're going to talk about event correlation and lifecycle management today. So Just to level set and get some Get some definitions. What is NFV in this presentation? So Definition from Wikipedia, which we thought was good enough Is a network architecture concept and technologies for it virtualization To virtualize entire classes of network functions building blocks that they connect and chain together for to create a communication service And we wanted to put this up here because we're making a differentiation between NFV as a service versus a VNF which is just an instance of a VM or Microservice that might be running in your cloud as opposed to the service which would be a chain of these put together to Provide a complete service from in the end So we put together a basic network service architecture So you may or may not have all of these when you deploy your cloud, but in an architecture you may have a central distributed and possibly Branch nodes where you have compute at different places so you can see the green line Which represents a service Going from the edge all the way in and you may have VNF's as part of your service running in any or all of these locations depending on What you have deployed and around it you have all of the supporting applications your orchestrators your VNF managers your VIM That are providing and helping you provision your service so there's a lot of overlap when you're Putting together your architecture and we use the Etsy as a friend Etsy is the framework for this Although you may be using e-comp or whatever framework that you You're going to base your app or your architecture on But you can see there's a lot of overlap. So as you've seen through the conference There are many applications or many different ways you can instrument Your cloud your management your monitoring and there's a lot of overlap particularly You can see in the VNF M in NFEO space Although Etsy or your framework might define a particular That a particular function be handled by the NFEO for example There may be other applications in your architecture that can actually do Those same functions So if you're not careful and if you don't define the swim lanes properly you can end up with multiple applications in your architecture Stepping on each other essentially thinking that well this went down I need to manage this or I need to orchestrate that and you get collisions So what do we need to monitor so? We've got the business as usual type information like the servers and the network gear that provide your underlay You have the cloud applications provisioning in the NFE infrastructure This will be your open stack services your orchestrators and those sorts of things And you have the VNFs themselves which are providing your services. You need to make sure that they're running and functioning properly Across all the different locations that are providing your service so What we're going to talk about later? for Is based on Pardon me is based on monitoring so you want to have your surface assurance infrastructure To make sure that you're grabbing all the information and getting it up into your surface assurance systems which are The collection correlation and visualization. So here we've got represented the underlay routers and switches You also need to monitor the servers which are hosting your Applications your them your orchestrators where your VNFs are actually running and Of course, you need to be able to get telemetry in all arms Out of your orchestrators. So when you have VNF spinning up spinning down moving around in your cloud You want to know that these things are happening and then finally You've got to be able to pull all this together to understand That your service is running from an underlay and overlay perspective So what we've tried to do in in our experience is we don't want to reinvent the wheel So we want to use the built-in intelligence of these systems The NFV orchestrators for example, we want them to manage the VNFs. We want them to manage the services And we also want to use the strengths of those existing systems to do that And also you have when you deploy your cloud you're deploying it to support a service Something that you're selling Or something that's going to provide value to your organization So it's going to be going into a brownfield sort of scenario So it's got to work with existing systems Processes And it's got to be supportable and operational within your organization one thing that We don't want to do is introduce a bunch of new Portals or views to say network operators They have their existing systems of processes. So we don't want to give them 20 different new tools that they have to look at 20 different dashboards or Or graphs To be able to manage the system partly because they're not going to want to see that and they're not going to understand What it all means and be able to put it together. So you've got to be able to collect all the information Put it into a system and be able to correlate and make sense of it so that You can support the service So the challenges that you've got are integrating to the external systems and making sure that just with Traditional monitoring that you've actually instrumented and gathered everything. So you've got many options For applications that you might instrument and gather your events with You need to make sure that those alerts and notifications are complete One of the things that we've seen for example If you're integrating with an orchestrator, they may tell you to connect to a service bus Which will give you a notification, but it may be a notification with Simply an ID number, which doesn't tell you anything about whether this is a fault or just information or regular telemetry And you've got to actually query back into that system To Understand even what it means just to get a summary or severity or any kind of useful information and in a high volume system You can see where this can become a problem where you can get behind very quickly Where you're getting tens or hundreds of events per second and having to query back into those system One or multiple times just to figure out what what the heck the event means And of course there's a new approach. You need to think of approaching and monitoring a Cloud service partly differently than you would a physical service so Where you might care about a particular router a switch on your physical network It may not matter as much in a cloud native application when you have VNF spinning up and spinning down You don't want to treat them as You would with a physical service because you want to make sure the services up But maybe it may be not necessarily how many specific instances are running At any one time you want to know that those are happening and you don't want to know where they're moving and that your service is up And to understand that You have to have some correlation. Thank you, sir. Thank you, sir Now that Dale has the entire network and all our resources Monitored censored and made available um correlating all the events that we're receiving from the network the cloud system subsystems and whatnot and Correlations very important in in the entire life cycle management ecosystem and correlation helps us identify relationships between events and and Also, it requires a holistic look of The service the network the underlays not only in the cloud, but external resources in the cloud as well and also requires Multiple data types multiple protocols multiple interfaces from from from devices that we're receiving the data from And of course large enterprise service provider Or a carrier you're gonna have a very very very large system could be thousands and thousands of Elements resource elements Excuse me and correlation is going to support us in And of course identifying the relationships and their relationships are usually topological or temporal time based and Also, of course service-based and with the data that we glean for the network turns into information and the information we can of course manage and Manage and manage our our systems and are in our network better and of course this will turn into knowledge and If we have knowledge We're on the cusp of a cloud whisper isn't that right, Dale? But the challenges that we have right now with the lack of standards and you know the community and The industry is working on these these challenges, but as Dale just mentioned Sometimes you get a an event from OpenStack, and it just has a user ID or ID And that doesn't mean anything to our brownfield systems That doesn't mean anything and to like Dale mentioned query more information about this particular event That's just that's just not scalable and of course Possible scaling issues with the salameter and rabid and Q The Q sizes and if you have a distributor architecture like we have it could make it It can make it even more challenging and of course as I mentioned before integration with It's not all gonna be SNMP. It's There's a lab at the protocols and data models out there for for event for event monitoring We have this concept. I don't know we kind of coined this term Healing collisions where we have an instance of multiple systems multiple devices Try to heal a VM or a VNF at the same time and we have there's gonna be have to a concept of a cross-domain orchestrator that has the awareness and can communicate the faults and possibly direct a specific Entity to do the healing and not multiple multiple entities trying to heal a single a single event or a fault The tools out there right now are pretty sparse There's a few out there and you can download them and play with them and what have you but we're really looking for server provider carry great solutions at this time and We have work to do So event correlation the opportunity of Course closed-loop monitoring as opposed to open loop monitoring where the event is just created into a ticket and an actual human Has to do something about it That's great I don't know if you guys ever worked in a knock but there's instances where things just go all caddy Wampus alarm scrolling everything's turning red It would be great if you could suppress some of those alarms reduce the duplicates and things like that and a good Correlation engine is going to help us with those capabilities and if aggregate if an aggregation router goes down as opposed to An edge router the ripple effect and the network Is a and the footprint of that effect is is very different and if you have a correlation engine It can help us with that of course And of course the the big picture is The correlation Increases our knowledge about the network and of course helps us with our strategic trajectory of the decisions we make and implement or don't implement in the network and Of course we're all in For business so the bottom line if we save a penny we earn a penny if you're not interested in In reducing I pick some sure your boss is so we all have to be cognitive of that and Correlation helps us reduced incident potentially reduced incident time and It just enables our human resources to work smarter and not harder. So That's the opportunity. So what do we have in the future? Odap is Working on Within their architecture framework, of course, there's a data collector analytics event engine It's kind of harken back to I don't want to say the old days, but It collects information and kind of Creates Modifies the events into a standard base model and then makes those available for all your analytics and all of your all of your Your database lakes and whatnot so you can share that information and open if open op nfv not open nfv is Working on a standard data model Which is a client server kind of agent thing, but if you get a alarm from a Juniper vnf it's going to look and feel similar to a Cisco vnf with the with the With the vest data model that they're that they're working on There's a few open-source tools vicharage is a correlation engine It is an open-stack project monaska is an open-stack project and Zabix. It's an open source, but it's not a Open-stack project, but those are Those are great. I guess germs if you will and they will eventually hopefully Meet most of our most of our needs in a in a service provider environment And there's also talk about Some some vendors are doing assistant assistant. I'm sorry machine learning and That's great and those capabilities will only increase but if we can take an AI solution or machine learning solution and implement into our into our challenges of Life cycle management will take it. We'll do it It's going to be all all better for the for the community But the bottom line is there's a need for a comprehensive framework whether that comes out of own app or open NFV And how do they reconcile each other or do they? It's all part of our life cycle management and Mike's going to talk about some of these challenges in detail Some of these challenges actually we've seen in the field yeah, okay, so when in life cycle management in this environment, it's kind of funny because Everybody wants to get it on it, right? So you have automation is a great thing But everybody wants to do it then So we have a Vim that if heat is used to spin up of a VM It will do life cycle management on that VM. You have an SDN controller Which if a service instance is bought up inside the SDN controller, it will do life cycle management You have a VN FM that thinks it owns the all the VNFs Okay, and then you have an NFVO that owns the service and all of these things will do life cycle management And then you have a cross-domain orchestrator that can get into the picture and it can bypass those things and potentially do life cycle management as well Okay so One of the things is You know this results in a lot of confusion and overlap in functionality You saw the Etsy model in the in the first part of the presentation But if you have all these things and they're not playing in there in their little box You have issues and we've seen some of those issues And there's no correlation in the cloud generally in any of these systems. They just say oh, that's mine. It's broke. I'm gonna fix it So I want to run through some examples, but one of the things I'll say actually before I run to an example of a trouble in the environment is One of the cases would be this became really apparent early on Is when we were doing some some stuff and a vendor told us well in order for us to show you healing You can't take the VM down You have to suspend it because if you take the VM down The the the layers underneath like the Vim with heat and all they're gonna fix it And so we can't even demo what we want to show you in the way of healing because It'll be healed by something else Okay, so our first trouble scenario that we're gonna run through all right the VNF goes down Okay, that information goes to open stack that information goes to the SDN controller Then information goes to the VN FM that information goes to the NFVO it goes to the cross-domain orchestrator So they all know this thing has happened They all know the VM is down But who fixes it and the answer is from what we've seen they all might try to fix it And we've seen scenarios where heat will actually try and fix it at the same time the NFVO Thanks to services down so it tries to fix it and this gets into a very bad state because He's tries to spin up another VM at the same time the NFVO is trying to delete the stack It can't delete the stack because there are ports occupied So it can't even delete the network that makes up the stack and so the NFVO eventually gets out of sync Now the service could have been impacted Right the service could have been up so the service wouldn't have been in this case because the VM is down Okay, but you have a customer now that It's not up. Okay, if he got the VM up then it might be up But it may not be configured properly So you don't even know what state you're in and if you were able to recover and get into a good state For somehow the VNFM the the VNFM and the NFVO think it's down So if something happens again, they won't even try to heal it because they're not in a good state So our next scenario that we wanted to cover was what about management networks? A lot of these things today are being managed over management networks All right, and they're being determined whether they're up or not over the management network Well, if the management network goes down Your VNFM and your NFVO and your cross to main orchestrator may all detect that this device is down Well, once again, they're not correlating right so they don't know that it's because the management network is down And they can't get to the device because of that So they're gonna try and heal it Well in this scenario the customer wasn't even down the customer was up and running happily because their traffic doesn't flow over the management network All right, and now when the VNFM or the NFVO takes that down they've basically caused an outage So they're trying to help but You have this scenario And then in our third scenario we talked about Something that's completely out of the domain of the of open stack All right, so something in a core network or a managed access network goes down Well, if you're monitoring for traffic flow once again, you think your service is down and Because they don't look outside of their domain They think it's something inside their domain So they may try and heal that that VNF when the VNF is up and fine Okay, now you'd say the service is down. That's right But when the network orchestrator is fixed that service maybe 20 seconds 30 seconds Customer could be up and running But in this environment with the end VNFs running on VMs a lot of the times these VNFs takes But take five minutes just to boot and get their interfaces back up So instead of it being a 30 second outage. It's potentially a five minute outage Okay, and and we've seen these scenarios occur All right, so the question then becomes What do you do? Okay, one possible solution and it doesn't have to be the solution you can do it without a cross-domain orchestrator But if the cross-domain orchestrators function the way they should Right, they will work across the domains. They will have the view of the end-to-end service Okay But they can't be the ones that fix the problems they just need to direct The domain orchestrators to fix the problems Okay, and in order to do that, right They're gonna have to be tightly integrated with all these domain orchestrators and they're gonna have to be tightly integrated with correlation and The domain orchestrators have to allow the cross-domain orchestrator to tell it what to do instead of just acting on its own Okay, life-cycle management as far as existing and working with correlation, right some of the things that we've learned right We've learned From collection standpoint, right that we don't always get all the information So that's one thing that has to happen We know from from experience that operators are used to using the tools that operators like to use Okay, so You can't just say oh, we've got this nice new GUI that does monitoring in the cloud Operations isn't gonna want a swivel chair to 50 different GUIs in order to determine the information they want You have to integrate the monitoring to other monitoring systems so that you can provide one view to your to your knock Okay, you have I've already talked a lot about tight integration Okay, you can't send an alert with an ID in it an alert with an ID means nothing to the other systems And if they then have to query multiple times per alert When you put it into a network where there could be thousands and hundreds of thousands of alerts in a short period of time That's not scalable Okay, so the alerts that come out of OpenStack out of the end of the year Oh out of the environment as a whole have to be contain the information that's needed for successful integration Okay, correlation engine has to be involved to be able to determine What does this mean? Who was impacted? How does it get fixed? Okay, and it has to tie into a cross-domain orchestrator or Integrate into all the orchestrators itself to make sure that the right actions are taken so and then Correlation could be done in the old world. It could be done in a new world, but anytime you introduce new systems Okay, you have to figure out how to phase them in to replace the existing ones You can't just introduce and I already covered this a little bit with the GUIs You can't introduce 50 things to replace one You have to consolidate so that NOx and people who work with systems have single points of view and then the big thing is Stay in your swim lane, but understand there's a pool Okay, and NFBO has a specific function But it shouldn't be focused that the only thing that exists in the world is its lane It has to know there are other things that are in that pool with it and work with those systems Okay, we can open up for questions now. There are two mics. If you have a question, please go over to one of the mics No questions Then everything was crystal clear. Appreciate you guys for coming and hope this is a value