 Like I said distributed network function virtualization. So something in the NFE talco space and My name is Rimae Antel. I'm a senior architect at Red Hat and Here I'm Fred Oliveira from Verizon and Hey, I'm Sherad Kumar. I'm from big switch networks. I'm the software engineer over there All right So we're gonna talk about what distributed NFE is Why we are looking at it why it's of special interest to telcos and then we're gonna talk about some of the lab work that Big switch had done in their labs to implement Distributed NFE to verify its visibility for deployment So first of all, what is distributed NFE? What do we mean by Distributed and the main thing here is it's distributed Geographically, so what happens is you have your control plane the majority of your control functionality Centered in your core data center. That's where you would have your open stack controllers and a lot of other things and then Across the WAN in remote locations. You place only your Computes so in open stack or your nova compute will go in the remote side Basically that minimizes the footprint that you have at the remote side and Centers all the brain power of your network in one core data center. So it makes it easier to manage It's applicable in many different cases and telcos and for enterprises for reason residential Applications Putting applications in remote points of presence for a mobile age computing etc And I'm gonna go into a little bit more detail on actual use cases so Which components go where right you have your core data center and like I said That's where you put the majority of the brain of the operation. So you have open stack controllers probably the bulk of Yes, some of your storage and some of the storage will also go in the remote side You will have your network controllers So how you control the network both in the data center and across the WAN and at the remote side So you control most of the control is centered as well You have your cloud controllers your orchestration your Deployment tools so how you deploy the application in the remote side You can also do it remotely from your central location as well as your monitoring troubleshooting analytical tools and Some of the applications the ones that you don't want to place at the edge or the one that don't need to be close To the end consumer you can place them also in the central side So basically that makes it much easier for you to have that Uniform homogeneous environment in your core data center that you know can be managed uniformly that can be easily controlled and Only the remote sites where you place your compute nodes with the edge Applications are the ones that need to be distributed and the ones that might require actual truck rolls in the telco speak so basically sending somebody in the truck to take care of the Equipment of the application, but the application itself is managed and controlled from the central side so you minimize your hardware footprint at the edges and Specific use cases in a generic way because Fred is going to address the actual Verizon use cases but in a generic way where we are seeing applicability of distributed NFV is for virtual CPE deployments and that can be specifically thick CPE so a type of virtual CPE where you actually run virtualized components on the remote CP device which is in this case would be x86 type of server Device, so you can have enterprise or residential version of them Enterprise can be a bit more involved and you can have various vnf's Running on that CP, you know firewall Load balancer when optimization security applications That deep packet inspection You can terminate your VPN IPSAC VPN on the device your network address Translation, etc. And you can also have some storage On the device and as well as you can run, you know, like web server or your VoIP telephony application on that as well and then The bulk of it is going to be in the central data center where you can have your access control your policies you know your Such a control from it and just configure the edge device in the net, you know for Stuff that you want to be close to the customer For the residential it's a much simpler CP, but it's very similar in the concept of it So basically you will have fewer CPU cores on it maybe less RAM, you know just much simpler and a cheaper device But it still would be running things you normally run at home on your current like router, you know, net firewall parental control applications, etc Also remote pop is where you would put your video cache for CDNs For instance to make it so you don't have to back all your video across your whole network You want to put it as close to there just possible Web cache as well and another application is mobile edge computing, which is you know big topic right now Telcos giving access to the developers to put their Applications very close to the mobile user for instance for Things like augmented reality or virtual reality things that require very low latency and have high bandwidth So you also don't want to backhaul it all the way to the center of your network So that's another application where you would see Applicability of distributed NFE and when you deploy something at the mobile edge You want the footprint to be as little as possible, and you don't want to waste your space for your controllers, etc and Here Fred is going to talk specifically about the Verizon use case and why they as a tier one Telco are interested in this type of architecture Hello So I think Verizon is for people and we have kind of a need for all the use cases that Reema talked about but basically we're trying to roll out new services overtime and of all the different flavors moving them from A centralized environment and distributed them potentially all the way out to a residential home the Certainly CPE use case is one of the largest ones that we see but again some of the new Applications that are coming up virtual reality augmented reality are one of those areas that require much lower latency much Closer processing to the consumer to get the card kind of functionality Bandwidth is tends to be our one of our highest cost functions and What we want to do is kind of limit the amount of bandwidth do as much caching as we can as close to consumers We can possibly get Just the whole and View that we're at Tennessee is that there is gonna be hundreds potentially tens of thousands of these are remote sites out there How do we manage all of these things? How do we distribute functions these things? How do we? Deploy applications across this both in a within a single unit or in a kind of a more amount of Distributed VNF where there's different components might actually be running in the central site Versus in the distributed edge Again the goal for a lot all these is to improve the customer experience provide newer services on-demand services as a consumer needs a caching function or Another firewall parental control all these things all the VNF could be downloaded Dynamically as they need it and then we would deploy it at a time And from a Teleco perspective reliability and availability are one of our key concerns by providing the compute nodes at the edges it reduces the amount of And keeping the control nodes at a more centralized location We can provide a higher level of redundancy a higher level of reliability from the control plane perspective Whereas an impact of a a local site Failure only impacts the local site doesn't impact at the rest of the environment as at the same time we're having a relatively low-cost solution for this Forward back this way So this is kind of the standard scenario. We have all the cases wireless scenarios files for some caching video deployment of internet services the SMB environment where you might have a Dunkin Donuts or something that is sitting out at the edge that wants to have a local processing function in there and this might be just a firewall or a load balancing or Function that's local and then a larger enterprise that might actually have a deployed a fairly fix if the multiple nodes in there that you can deploy actual Scalable functions in that location What we'd like to do is again distribute the vnfs across this whole hierarchy some of the vnfs might run at the centralized more centralized site and then all the Vnfs that are specifically a latency sensitive or bandwidth sensitive would run as close to the consumer as possible One of the things we can't talk about that historical context says that the The cost of compute the cost of storage is decreasing at a much faster rate than the cost of our transport and so this pushes us towards having as edge computing as much as possible and doing Limiting the amount of bandwidth that we have to backhaul to a central site and so we these we see this as a a continual trend that The computing power will move closer and closer to the edge And that by doing that the number of compute sites will grow again potentially the thousands So our goal is as a dynamic control deploy services to the right locations to the in the scale that they need to get deployed Leverage all the infrastructure at the appropriate places failure happens at a certain location. We can run the processing somewhere else or potentially locally Dynamic service graphs and this gets to be kind of oh as we get farther along We want to be able to Deploy these applications some of them on the same site some of them Potentially at a more centralized site being able to insert v&f's into a graph transparently and enable the Transit through this path and being able to deploy the application Wherever it makes sense to deploy and then highly available service management And No So this is we're just talked about some of the environment So we worked with the red hat and big switch and said let's try and figure out what makes sense to validate some of these issues we're concerned about how the Control protocols of OpenStack work how the Management of the network Will work and whether that's feasible within the constraints we have And we're That's where I will point out what specifically the some of the functions that we have there is Some latency concerns. There's bandwidth of functions things like logging That does want to be come back to the central site and so these will kind of limit the functionality So these are some of the challenges that we see In the way, we're in in this our current environment. How do you extend an L2 segment? Without necessarily having having kind of different speeds or in the second different segments of your network and there might be a local Top Iraq switch for that kind of function that has very good local Performance, but the connection of that local switch tour up to a centralized site so you can actually understand the whole topology May have gaps and then if you want to extend a VNF Service grabs from one location To another you'd have to extend the L2 functionality potentially across this environment The opens that control plane is kind of one of the big concerns we have of monitoring health indications making sure that there is Availability of all those connections and and that the having this distributed nature of function doesn't Destroy the monitoring capabilities the visibility environment And then service resilience is kind of our environment want to be able to run environments in the face of failure modes Failures at the edges failures at control nodes and be resilient in all these places And then we'll talk about some of the network configuration. How do we maintain it? How to troubleshoot it? These are all pieces that we need to have a single view of the network single view of the environment that we can manage from a centralized site and provide Visibility into what's going on network for all of our operational teams You want to talk about setup? Yeah, so at big switch we tested the following topology in our lab setup So we had a core data center. I'm not used to looking back, but yeah Yeah, we had a core data center wherein we had the big cloud fabric controller clusters this is basically a pair of BCF controllers in Active standby mode to give a high availability and resiliency We had spine switches which were connected to top-of-rack leaf switches And then we had a rack of bare metal controllers Yeah, then we had a rack of bare metal servers. We used a red hat director To provision as our under cloud to provision the rest of our open-stack cluster Which includes the open-stack controllers as well as the compute nodes on the compute nodes We had switchlight vx running which is a virtual switch from big switch Which is also programmed by our controller Then we had the remote site wherein we had the top-of-rack leaf switches These leaf switches were directly connected to the spine switches in the core data center And we had compute nodes these compute nodes are also orchestrated by the same Director sitting in the core data center and then we had a latency generator appliance This was a homegrown DPDK based Application which just adds latency for every packet that passes through it We ensure that every single wire from the core data center to the remote site goes through this specific latency generator to simulate the Van so this is the physical topology of the lab itself We had the core data center with the BCF controller clusters We had the spine switches and the leaf switches both in the core data center and the remote site and Then we had the rack of resources that are running our open-stack clusters And then we wired them up and this forms the data path for our fabric Our controller uses an out-of-band management network to communicate with all the physical switches So that is pointed out by the dotted lines here and To talk to the virtual switches the controller uses an in-band connection That is highlighted in the orange lines here Then we added the latency generator Right in between and we ensure that every single wire from the core data center To the remote site passes through it hence adding latency to every packet The objective of the whole test was to validate that the big cloud fabric was resilient Even in the presence of van latencies So Verizon their metro optical van has a maximum latency of 40 milliseconds from The East Coast to the West Coast and that became our magic number here So the test for latency is between 0 and 40 milliseconds, which is an RTT of 80 milliseconds So the primary focus of all the testing was to validate that To see how latency affects the control plane specifically speaking We want to see how it has an effect on the out-of-band management network that the controller uses to program the physical switches also the in-band network that the controller uses to program the virtual switches and Also the open-stack control plane communications that the agents that are residing on the compute node talk to the controllers Which are in the data center? So these are the tests performed we used a simple ping application to perform a ping from a virtual machine Running in the core data center to the virtual machine running in the remote site and the criteria for success for all the test suits That we ran was that the fabric shouldn't be losing even a single packet So we started off with the controller failure scenarios to begin with we did the failover Where we force the active controller to fail over so that the standby becomes a new active We also tested the headless mode wherein we bring down both the controllers at the same time but the The fabric still continues to forward the packets because it doesn't need to have the controllers But any changes in the network will not reflect a change in the fabric because the controllers are absent altogether So in this test suit we didn't see any packet losses at all The second set of tests we ran where switch disconnects and reconnects So every time I leave switch disconnects from the management network The controller has to know that okay. I need to take that leaf out from the fabric So what he does is that he? reprograms a subset of switches that are affected by the remover that switch To ensure that the fabric is still fully connected The next set of tests where interface up and down Here what we do is that we can either do a shut no shut on specific interface for a given leaf using the controller itself Or you can go to the switch and just yank out the wire So in this case also the controller it You know it read you know reprograms a subset of switches to ensure that the fabric is fully connected and there are no holes there and Then the last tests where that we ran where Switch reboots wherein we would just physically reboot the box itself. This is just a you know extreme version of the previous tests Here again, the fabric has to you know remain resilient and ensure that Everything is fully connected So all in all what we saw was that during all the tests that we ran We didn't see a single packet drop in the fabric and also when it comes to open stack Agents we didn't see any of them timing out as well in this particular test I'll hand it off to Fred to conclude So again, I think our initial concerns with just Can this environment actually work and I think There's a several pieces over here that we have shown the do work One of the other issues we have is what's the given the face of various latency issues Network availability what the various connectivity and failure modes we might encounter can we actually survive this and Run this environment long-term and I think we've shown at least from a that it's a feasible function I think we have certainly more work to do that to demonstrate scale and this kind of environment but from a kind of the initial use case Validation we've shown that it does work in this environment and that we can run this and again. I have to think Certainly red hat environment has been And this environment and worked and big switch has provided us All the network functionality and then that has worked as well, and I think that's our last site. Yes any questions No, we got one Question for big switch. You mentioned in the core data center The big cloud fabric is extended out to a remote site the leaf in the remote site How many are those remote site leaves could you actually support as part of the big cloud fabric in the core data center? So So today the way we have is that we have a class architecture in which we have six pines and support up to 16 racks So there should I mean so in effect the single fabric and support up to 16 racks So doesn't matter on the distribution We have the way we tested it was just a single rack within the data center and a rack in the remote side But we don't see that as a meeting factor. Okay. Let me rephrase the question if you already have 16 leaf switches in the core data center, right? How many remote leaf switches can I have a minority maxed out? No, so 16 Leaf switches would come up to eight pods right because on each rack you would have two top of racks switches So in that case you can have eight in the data center and eight in your remote site or other eight remote sites with a single rack each Okay, but 16 Okay, thank you Yeah, I'm interested in your choice of a simple ping test between VMs to verify this So one level that we just seem to be establishing basic network connectivity continues Rather than actually testing of the open control open stack control plane itself continues to function You can you can create networks under those latency conditions can No, we'll spin up VMs. Did you did you look at that? That's so our next phase Essentially, we do want to run an actual application So not even just, you know, trying to spin up a VM set up your networking, you know, deploy an application Full, you know function in application something like a web stream or video stream or something like that It just that we ran out of time. No, no, I think I want to add I think one of the other successive issues is how do you do a real service graph where you're crossing potentially this boundary that has very high latency your potential very narrow pipe how do you distribute the environment VNFs across that path and have the knowledge of where to distribute that thing so again We have lots more work to do in the future. Yeah, because I remember I think it was BT about a year or so ago We're did an analysis of more or less precisely this kind of architecture and concluded that there was quite a lot of Issues with using open stack to do this Really in the control planes of things like the scalability of Nova the sensitivity noted over to latency and stuff like that Yeah So that's one of the main concerns and one of the reasons why we are looking at this because The issues with the message bus that might potentially pop up, you know, your database access Etc. Your ability to communicate to your Nova computes You're basically a control plane How can it function because you know data plane is going to be Remote and distributed so you expect it to work But you know that type of communication is a problem, especially if you don't want The cells are not ready for consumption in this type of environment potential as they can solve Any issue with distributing your compute nodes to the edges, but if you're not using cells What is you know the feasibility of that type of architecture? So, yeah, we definitely realize that we need to do a lot more work plus as you saw It's only tested with one remote side Potentially we want to be able to do ten thousand and more right the remote sites because the idea especially for things like virtual CPE is to be able to deploy to you know Have like one or two data centers that manage the whole huge region like the United States for instance or EMEA or you know large countries And large regions so you don't want to be able to only run, you know Few of those you want to have tens of thousands potential remote sites Somehow related to the previous question. Well, I saw the posting of J-pipes recently on OpenStack Dev And where he expressed some concerns for VCP in OpenStack based clouds, maybe you saw it I advise you maybe to look in regards of that that is about resource Requirements that you are kind of figured out when it comes to Nova compute Have you come up with something minimums that you decide? Okay? That is the minimum number of CPU and RAM that we will have as our compute hosts or Yeah, maybe you haven't figured it out. Well, so I we have minimums in kind of what we want to deploy and kind of our kind of virtual our CPE or thin CPE environment is in the order of one atom CPU and so that that gets to be extremely small and can The limitations on that will be interesting It's the overhead of running even the compute services Might overrun the capacity of the RAM in this kind of environment But again, I think those are limits that we need to explore some more Yeah, we've done internally some of the work that you're talking about Analyzing how many CPUs have been consumed by running Nova compute and then with addition of each VM, you know what It gets consumed especially in some of the Applications you might want to be running things like DPDK That's gonna consume even more of your CPU resources Unfortunately, I can't really share the data. It's internal stuff at some point. We might you know publish it It's a different group It's their decision essentially, but yes, we we are doing that sort of analysis We have that in mind. We realize that it might be a problem We've had discussions about, you know, how to minimize the type of footprint. What are the options for deploying remote virtual CPE? Without actually, you know completely getting rid of open stack because we think open stack adds a lot of value In terms of the type of orchestration you can do and type in terms of having a homogeneous type of environment to manage You know one way of deploying your VMs whether it's internally in your core data center or remotely in your You know geographically distributed sites, so we still want to keep that type of Value without you know using all your resources just to run your Nova compute Thank you everybody for coming. Thanks for coming and if you have any other questions afterwards feel free to come talk to us or Reach out to me personally. I don't know about the other guys You can you can very easily find me. I'm the only remit yontal in the whole world. So if you Google me, that's gonna be me Thank you