 Hello, everybody. Good afternoon and welcome to this session. Our session is going to be a panel discussion on achieving five-ninths of VNF reliability in telco-grade OpenStack Cloud. As you all most probably know, that VNFs or virtualized network functions require five-ninths of reliability to be able to deliver the SLA that is required in the telco network. And you most probably also know that achieving five-ninths of VNF reliability on OpenStack-based cloud platform is challenging to set the list. However, look no further. I have here the esteemed panelist representing from four companies who will discuss about those challenges and provide some of the proposed solutions. So without further ado, let us get started. On my left is Kandan Kathirval. He is Lead Principal Technical Architect from AT&T. Sitting next to him is Wayne Walsh. He is SD and NFV Solution Architect from Intel. Next to Wayne is Rima Yontel. She is Senior Solutions Architect from Red Hat. And farthest from me is Fausto Marci. Did I say it right, Fausto? He is Principal Cloud Consultant from Ericsson. We're from the same company, so I get some leverage. And I'm Hasib Akhtar from Ericsson as well. So let me just tell about the format of the program before we get started. First of all, I will give our panelists an opportunity to share their thoughts on achieving five nines of VNF reliability on a telco-grade OpenStack Cloud. And after that, I'll be asking them a few questions, as well as at the same time, we'll be accepting questions from the audience. So if you'd like to utilize your opportunity and ask them questions, please start to think about them. And as soon as you're ready, you can just line up by the microphones on both sides, and we'll try to get you in as soon as possible. So with that, Kandan, if you could please share your thoughts on this topic. So the services exposed by the telcos, they are exposed to millions of users. For example, the phones, what we use today, actually, they're all the services provided by a telco. And behind that particular services, what it used to be is a physical network function. For example, routers and customer edges, all sorts of networking devices, which are primarily physical devices, which is deployed actually in the telco networks. And this is how actually for decades, these ran as physical network functions and mostly supplied by different vendors. And the thing is like it is customized in every aspect of the layer. You can see in this particular chart that physical network functions, like they are built on specific hardware and purpose-built software. So pretty much actually it is tuned for a particular application and the nativity of that particular functionality provided by that particular device. But today, what is happening in the industry? I think you guys have been seeing that in this summit that there's a lot of discussion going on actually to get this all physical network function into a virtual network function. So the one thing is happening is that people are just converting actually all these firewalls, routers and all sorts of networking devices actually into the virtual functions. What they are forgetting is that it's going to run on actually a cloud. For example, it's an OpenStack cloud. And the cloud has its own way of providing a resource and an operational methodology for the workloads which is going to run on the cloud. So it is not exactly the same when the things come from a physical network function to a virtual network function. So from the design, the things need to be considered that the VNF has to really consider that the workload is really going to run actually on the cloud, not on the physical workloads anymore. So that is pretty much a key. And we see in the industry that AT&T is actually poignering in bringing this transformation actually into the industry. And we are actually seeing a lot of challenges that people are just taking the physical function and converting actually into the virtual functions. And it is not meant to or designed to purely run actually on the cloud. So that's what we want to talk about actually in this panel, about what the challenges we see and what we have to see in the industry that has to evolve to really support this functionality in the virtual network function. Can you go to the next slide, please? So one thing actually we've been doing in AT&T is that we've been measuring the availability of this OpenStack cloud. And you guys may have seen that we presented a keynote and in the keynote we talked about how large scale we deployed OpenStack cloud in the AT&T platform. And we measured actually how much OpenStack region is actually providing in terms of the availability because the availability is pretty critical because all the functionalities and services we provide out of this network function is very critical because we don't want the cell phone to go down or we don't want the internet to go down, right? So these are very critical. And therefore the availability of the cloud and the availability of the VNF is very, very critical when it comes to actually providing services. So it is very critical that all the functionalities of the cloud are reliable and also the functionalities which are deployed by the VNF is also highly reliable. So the one thing we see in the industry is that the OpenStack region today meaning that all the NOAA neutron deployed as a region it is about three nines which is three nines is about 8.76 hours of unplanned downtime per year. This does not include any planned downtime. For example, we take for upgrade maintenance and all the stuff. The platform which is the cloud platform is only providing three nines. About this three nines, the VNF has to build their availability, right? So they have to provide like five nines and six nines and above. So that really means that they have to build availability beyond what the platform could really support. So this slide actually shows that VNF is transformed meaning that if a physical firewall is actually a one physical box and it is transformed actually into a single VM in the cloud and they would not reach two five nines because they will be less than three nines. You can see on the bottom of the chart there is a number that shows that how the deployment model of the VNF and how much they can reach in terms of the availability. So what we see is that like if a VNF is actually deployed as a single VM or even a multiple VM actually within a single OpenStack region it's about like three nines of the availability which is about like 8.7 hours time per year, right? Then if there are two instance of OpenStack within the same data center then we could actually have the VNF split across the two regions that would definitely will increase availability to a four nines because you have two instance of OpenStack region and the VMs are split across multiple servers so the failures are accounted actually in terms of like if there is a physical server failure. So it would reach about like four nines. So the true VNF that what we call is actually Cloud Awareness Aware VNF is that the VNF has to be minimum in two location and at least you know like two to four location that is pretty much the key. The reason is that why it is two to four location actually because that's when actually it will really reach a five nines of availability by giving the platform is it's only the three nines. So this is what actually we are learning through the industry and what we are actually experiencing actually and there were a lot of discussions people who had talked about this and it's pretty much that you know community has to not only concentrate on many projects and many functionality but also has to you know consider that availability and stability of the platform and the OpenStack is very very critical. Next slide please. So we think from AT&T perspective that these are the key functionalities or areas that community and we are willing to work with the community to enhance with respect to the OpenStack for example Hitlers upgrade this is a one key area it is really needed actually to reduce the overall downtime to actually upgrade the OpenStack from one version to another version and then policy driven live and offline migration this is also very much important given that you know like the VNFs are trying to achieve like high availability also the high performance they are trying to use like SRIOV huge pages, CPU pinning that means that they are pretty much locked down actually in a physical server right until the VNFs are truly became you know like highly available across multiple locations it is very much needed that platform has to support this live and offline migration so there is already a functionality but this has to be enhanced to support SRIOV, CPU pinning and huge pages these sort of like new technologies which are used by the VNF to increase the performance of that particular VNF and also the OpenStack you know like has to adopt that multi location awareness this is very much needed because we expect the VNF has to be deployed in multiple locations but the fact is that you know like in order to deploy multiple locations the cloud has to support to allow the VNF to deploy in multiple locations so that is very much a key so that is why we like to see that multi location awareness and workload placement in the OpenStack Rally testing framework so OpenStack Rally is really awesome and it has a lot of you know like test cases could be used and we actually use like in testing the scale testing and performance testing so there are two key areas missing actually in the OpenStack Rally for example resiliency testing if an OpenStack controller one of the controller goes down like how do I stimulate it and test it before it happens actually in the production when something fails like how does that will react in the production so that has to be enhanced in the OpenStack Rally and also the other thing is like we expect I've been actually on multiple you know sessions and the one thing I see that is like people are expecting that the OSS-BSS especially the monitoring aspect of it is really outside the OpenStack I think the one thing has to happen is like anything from a healing aspect like for example if the OpenStack is deployed in the three controllers and if one controller goes down there need to be a way of actually auto healing itself to became you know like three controllers so these are the areas we really see that enhancement has to be happening but it doesn't mean that it's only all OpenStack but also actually we like to see that VNF has to evolve and especially that the VNF has to be both locally and globally the way I was showing on the slide to truly achieve the finines and above it has to be deployed in multiple locations so it is very much the key and that's what we're saying that you know like it's not just the requirement and recommendation to the OpenStack but also actually to the VNF and also the key factor is that VNF has to use the functionalities exposed by the OpenStack for example you know like OpenStack you can actually use anti-affinity rule in pushing the VM into a multiple server so that you know like you have to use all the servers so that is pretty much a key so it is not just actually exposing the functionality out of the OpenStack and having the APIs exposed but also actually that VNF has to be using it then only you know like they will achieve the high availability thank you so the next file is kind of a double click down into the data center or the cloud paradigm that Candem was describing and what we're looking at here in the highlighted section is the platform in that context and in this context we're talking about the compute, the network, the storage but also the OS and the hypervisor and to reach the reliability that Candem was talking to the cloud and the infrastructure needs to be more intelligent and needs to understand and correlate the physical and the virtual aspects of that cloud at Intel we're looking at platform service assurance being an essential part of giving that intelligence to the cloud and to the data center so when we talk about platform service assurance we kind of talk about it in three main terms provisioning monitoring and telemetry and then action and from a provisioning perspective most people jump to enhance platform awareness and using the intelligence of the infrastructure to place the VNF in the most adequate location but what we also need to think about at that time is is that location giving me the correct information I need to ensure the service can be provided and can be provided over the life cycle of the VNF and the function it's supplying and then we look at monitoring and telemetry what in the infrastructure do I need if I'm going to use an accelerator if I'm going to use a specific CPU or NIC what information do I need off that infrastructure that tells me I'm hitting my SLA or if I am going to have a service degradation so within monitoring what we're looking to do is get an open set of APIs both northbound towards the VNF and towards the element management system and east-west towards the VIM so we have a common data set and a common understanding that when the VIM is reporting infrastructure and other service degradation and faults and the VNF is reporting through its element management system that there is a way to correlate both the physical and the virtual elements into a common fault and that we don't have alarms all over the system and then from an alerting perspective or an action perspective do we push everything up northbound and into the manual layer and let that make the decisions what do we have to fail fast if we have migrated a VNF or if we're seeing service degradation can we do more intelligence lower down like thresholding so if we're seeing issues within DPDK or an SRIOV interface or if we're seeing common error failures in DRAM can we make intelligent decisions at the platform level quickly rather than having to push everything up into a decision making process and then move it where we'll have to make a decision before we've taken action so if we move to the next file so some of the things we're looking to do is as I already spoke about an open API so we have a set of data a bus that everything can subscribe to and also then how do we automate that provisioning and that monitoring through the salameter, the heat ironics and within EPA and the enhanced platform awareness so when we're placing that VNF we have that monitoring telemetry baked into our thinking at day zero not day zero plus n and then the again the novice scheduler so that we're thinking of that the whole way through the process right so to further build upon the points that come down and made in some cases the provider services for instance mobile user consuming content on his device or making a call or an enterprise connecting remote branches with the main office using layer 3 VPNs these types of services are not confined to one VNF or one site or one open stack region they often spend multiple VNFs possibly service chain together through the different geographic locations through the different open stack regions and you want the service to stay reliable and always on regardless of possible failures in some hardware or in portions of the platform so the end goal is that reliable always on service enabled by the platform that has end to end type of monitoring and full detection and alarming and is able to take actions to provide the self healing service end to end this is sort of if you will a nirvana for service providers right without any intervention your service works even though you have individual hardware failures or software failures so how do we achieve a service like this you know when you have a fairly unreliable cloud infrastructure underneath so without having to over build everything well if you look at the MANO stack you will see service orchestration and network orchestration so these two components need to be able to make service decisions dynamically by not only at the point of deploying the service but through the life cycle of that service so when there are dynamic changes in the network and your platform you react to them dynamically as well how do they achieve it well they get that data and the information that they make decisions on from things like service assurance and analytics part of the MANO stack those in turn get their data that they make decisions on from the interfaces they have to the VIM to network controllers and possibly to even the cloud aware applications once you know they introduce to this environment so you gather the information and you correlate it from the hardware on the individual compute nodes things that Owen pointed out that the aware of you know possible hardware failures or just degradation in service of a particular node and they correlate it up to the hypervisor and further up to the instances that are only on that hypervisor meaning the VNFs or the VNF components but then you take it a step further and you correlate it to the service itself where you have different compute nodes across different sites and across different regions so basically you tag your service touches as one thing that you're looking at and you look at the health of the whole service as opposed to the just health of individual component so you know when something fails whether it's important and you need to migrate or to take action or whether it's something that can be taken care of later or just completely ignored the ways they do it in say Google or Facebook data centers you know just to make sure that you have that sort of end-to-end service view you can make your decisions and you can treat your service as one correlated event if you will and basically this way you can ensure that the end-to-end SLA is met without having to worry about meeting an SLA on each individual component or even thinking of individual components as something that is important so next slide please so but to be able to do that and that's something that right now OpenStack is working on through projects like Cilometer and Manasca and few others things that allow you to collect monitoring data reporting full detection alarming but you need to be able to bring them all together across different regions geographic locations into one pane of glass so you need tools to measure monitor and report end-to-end from the customer you know if you're talking about the call from the person who makes the call to the person who makes out the call and make sure that it all is visible to the service provider thank you so linking to the following up one of the previous point on smart workload management it's important to take in consideration that the services availability it's a fundamental part to achieve 59 reliability so on this pane we are going to show a basic workflow where we are going to take in consideration compute node HA on a local which mean within the same data center or global on multiple data center so this will be a general overview so basically the prerequisite in local environment is that the compute nodes needs to have a shared storage and also excuse me also in the disaster workflow it is important that the detection of the disaster itself this can be done by monitoring hardware the compute nodes and the hypervisors so once the the disaster is detected is it possible to evacuate the compute nodes so evacuate all the cloud which is more likely to be the virtual machine with all the related metadata to another compute nodes at that point the users can connect to their services to the other compute node and the service itself provided by the compute node it's available keeps being available one of the risks of this strategy is the fencing because if in the meantime we didn't fence the the compute node that failed we can say bye-bye to the 5.9 reliability so on the next slide we can we use this we show a similar workflow for compute node HA in multiple data center where it's quite similar the difference is that in the operational workflow we need to know beforehand the floating IP managed by the compute node itself needs to be managed and announced with the routing protocol which can be OSPF or BGP or SPF it's better if there is an internal link between the data center at BGP it's better if there is if the link it's through the internet so in the disaster workflow basically when the disaster is recovered basically with the same principle described before the compute node the compute node is evacuated by the compute nodes in the other data center and basically the floating IPs to the new VMs needs to be re-announced with the related routing protocols because otherwise there will be no routing available to reach the services provided by the compute nodes so this is a fundamental part and therefore the floating IP are announced by BGP and OSPF so the users can connect again and all the traffic is redirected from the old data center to the new data center which is a nice thing by having multiple data center we avoid the fact of concentrating the workload of the failure data center to only one other data center so we can spread the load and the service will be better the risks of this is that if we do not fence the compute node or the network or the data center when the service will come up bye bye to the 5 nines next slides please so we need tooling to solve this and something that is being done in OpenStack and we need to do more and we need everybody to provide feedback and insight and sharing knowledge on this and this basically is it thanks thank you all if you have any questions please come by the microphone and ask the panelists and while you are thinking about your questions let me ask them my question so what changes do you like to see to make the VNFs more cloud aware start with Kandan so it goes back to my presentation it's really come down to the mantra of two things the VNF achieving high availability from the design so that is very important factor it is not that while deployment you think about high reliability or availability but from the design you have to think about high reliability and availability that's one key item the second item is actually the platform has to be highly reliable and available and it is very important that in the open stack the features and functionality what we are listing down actually this has to be incorporated actually into the open stack so that the platform can actually support high availability for the VNFs which are going to run I suppose from an infrastructure point of view coming at it from a vendor a silicon vendor perspective our role is to make sure that the cloud is more intelligent in any place they can take advantage of everything that it's supposed to and that the correct information is being fed forward so the service can be made more available I think that's an orbit well if you look at the workloads themselves the telco applications right now very stateful so one of the things that can be changed is to look at them and see how they can be made stateless so they are not as dependent on being up all the time so you can lose an instance of the applications without losing the application itself so if you see the environments of huge environment basically they are it's where the biggest challenges are addressed and the biggest problem are solved and really the technology move forward it's not uncommon to see that in big environments highly scalable environments you can find multiple open stack distribution being part of the same let's say cloud platform or the same company asset it is truly important that the vendors of open stack distribution make all their efforts to work together to provide and achieve a common framework of to achieve like a 5-9 reliability rather than each one pushing or providing his own proper solution that it's probably little it can be integrated very little with the others so I think the next challenge it's really to work together and to provide to the community a common framework to add this by being a little bit more open rather than going on its own solution vendor driven ok thanks thanks everybody so we have some questions from the audience we'll take this one from first I think one comment if I understand correctly is when you do the service assurance you look at the service level rather than looking to the individual component the question is how you address at the component level you have you're building some HA they say if you have some of the component fail but it still doesn't impact your service you don't want to wait until like all the component fail then you will address that so how you balance look at the service level and also more proactively look at the component level it is very good question so when it comes to the service availability there's going to be multiple components with respect to it for example there could be hundreds of VM which is making actually services usually in Telco world if you see that it may be ten thousands of VM services in the service and maybe not all VMs and VM apps are actually on the critical path in providing that particular service and those VMs could be addressed not five nines maybe it's less than five nines like four nines or three nines but VMs and VM apps which are has to be supporting the services which has to be highly available and that has to be built from the design and deployed in such a way that the way I was showing the data center to really achieve that high availability so it is not a single answer for all the VM apps and VMs but it really goes by that particular service and application like how has to be deployed so when the service is being model so they had to look at actually all the aspect of the components going in the service then try to see how to balance actually that which VMF is really needed higher availability to achieve that particular services which is going to be offered to internal or external customer you want to add? Sure I also think that they're sort of independent you want to make sure your platform is working because you want to run services and then you want to make sure your service is running so you have a view of your service and you look at the individual components of the service and you see which ones might be in potential trouble but that trouble is determined by the state of your platform so if your service and the individual components of the service each application are able to maintain their availability flexibly so it's not only like one VM performing a particular task but multiple VMs that can take over each other so if one of them goes away it's not detrimental to the service itself that's how you're assured on the VM level but then on the platform level being able to like migrate all the whole host if you see that one of your hosts for instance is about to fail so being able to proactively react to different events on the service level and on the platform level and also correlate them between each other so if you see you need to migrate a particular host the notification goes out that the VMs that are running on this host might be going away for a little while so don't schedule a task on those VMs because you have other VMs that are doing the same functionality and they can take care of those tasks right now so it's a cooperative effort between all the different components okay thanks okay the next question sorry I see there is a very to provide also an answer to the previous question it is also important to understand that even if the all the toolings are available there for detection and correction and so on it's more likely it's not likely to happen that we will are going to have completely self healing and self resilient services and code and so on if that will be the case I bet that at least 70% of the people that are here wouldn't be here probably in the first place but it's important to have also an operational managing operations and I really mean it in an operational way and that is a part that we are not going to escape from the real world really go ahead please question here to the panel how relevant is this finite really in the community of service providers we hear different things if I go to China mobile they will say who the hell cares about it 49 is enough for me if you go to google they say forget it we don't need all that so services coming up and desegregation occurring in the gateways and all that what do you see the relevance of this and why do you think that service students is so important it is very important that when I take my cell phone and call 9-1-1 I expect it's supposed to work right so it's really it's really a depend upon what services are actually provided by this VNF right so it varies from companies to companies right so if they offer a particular service I would say that okay I'm going to satisfy this particular level of service assurance and they have to stick to their numbers so depending upon what services these are actually so depending upon that a specific availability numbers goes on but what we see actually in the telco world is like especially in U.S. that most of the services offered out of the telcos are actually need to be highly reliable and the VNFs themselves has to be more than finance to support those services and you talked about the micro-services actually it's a good point it is one of the enablers actually to achieve there's actually the separation of functionalities of a VNF in the multiple VMs because we have seen that usually people are actually converting actually a physical firewall or actually a customer edge router as actually a single VM and these micro-services would actually allow to split the functionality on multiple VMs so that you know like you don't have them failing down and you know creating an outage to the application and it is a good question hopefully it answered your question. I suppose just to hit the question in a slightly different way if we look at the move towards virtual functions on to infrastructure if we mimicked the way we did it of old we would just do a single VNF instance on one piece of hardware and not do multi-tenancy and then we start to lose the TCO model we've all been working on over the last few years as to why we're moving here so we can now land multiple services on a single infrastructure and to get away from the 5.9 nomenclature and more talk about the assurance speak as you mentioned is how can I assure that service when I have several services on the same infrastructure and that's back to I think what everyone has mentioned here is being able to identify the utilization by each virtual function on the infrastructure and being able to tell if something is taking its unfair share of that infrastructure so if we use a basic CPE model where we're delivering services to an enterprise if I'm delivering a firewall on a router surface on the same function and they're both using different areas of cash and someone is being disruptive who am I migrating all of it so it's back to I think what we're doing is that service assurance level underpinning the 5.9's reliability and I think it was a really good question to that end can I go to the next question first let's please go ahead thank you so my question is more of a tactical question it seems that sometimes we can be in a race condition of they can always be enhancements to reliability but until like would that get us from deploying perhaps not what would be the best strategy to address the availability question in the interim was it would it be to push the application developers to expect failures in the infrastructure or is it to rely on partners to develop custom code or glue if you will around the open stack components well not to say anything bad about people who develop intelco applications and some of them are sitting on this panel well I guess Ericsson but right now it's more realistic that the platform is going to be made more reliable to accommodate the pet applications that are running on it but our hope at Red Hat for instance is that we can work together with the application developers with the you know telco services developers in changing their mindset and how they approach the applications so they can be platform aware so they can be self healing and it's much easier to later on pull off and make the platform less reliable then you know if the applications don't meet that expectations of being cloud aware then you know building up your platform to make it you know put crutches under it if you will it's more reliable so right now yes we're working on creating HA for instances for compute nodes for controllers for everything because we want the platform to be as reliable as we can make it next question please thank you Scott Fulton from thenewstack.io VNFs require 5 nines of reliability do the contributions that you expect to make to open stack and you and your organizations will those contributions make 5 nines available to everyone in open stack or do you foresee that there will be classes or categories of users from with telcos being on the 5 nines and then a group on the 4 nines and a group on the 3s who will use a different platform and who expect less from it it is a very good question so whatever it is getting contributed actually from the telco community back into the open stack it is for all open stack is for all so the fact is that once the functionality is actually in the platform for example resiliency feature is actually in the platform then it is up to the user it's up to the deployment person to decide like how to use it so there could be hundreds of api or there could be thousands of api it doesn't mean that all thousand api has to be used by all or it's applicable to it they will actually make use of it so the intention of the telco world is to really have all this nice cool features introduced actually into the platform so irrespective of who the user is so you got a stable platform then depending upon how do you use it what is the enterprise to decide like how they want to use it I just wanted to agree with Kander and thank him for that because at redhead that's our philosophy we don't want to fork open stack one is for enterprises one is for carriers one is for somebody who wants to play around with it no we want to have all the features in it all the resiliency availability everything and then it's up to you no well that's all the time we have one more question ok go ahead until I am told to live go ahead so in the telco world normally a VNF is not just an application that you can spin up with an api call let's say if you spin up a router or firewall you configure that specific to your end customer like you need to configure the firewall rules or VRFs so if the VM goes down and you want five lines of availability you can't just spin up another instance of a VM somewhere and expect it to work because you need to copy the exact same configs to the new VM so that means you need to do regular backups and you need to keep the history of all the services somewhere so where in this architecture would you do that it is a very good question so usually the VNFs depending upon how it is deployed so the way I was showing in the slide is that you pre-deploy in five locations for example or three locations depending upon how much availability is needed so that is how the availability has to be achieved so the part is whether you deploy during the disaster or you deploy in a pre-deploy most of the condition actually it is pre-deployed because you don't want to have time taken because to create itself it would take some time and it may be like two minutes, three minutes and three minutes would impact that five lines availability or six lines availability so the fact is that it is not just the VM and we say actually have the VNF provision it is not just by creating the VM but also you have to provision the code inside it for example attacker is a nice project actually coming in the open stack program is that to actually not only create the VM you can actually go and provision the VM with respect to the codes and configuration needed actually in the VM so it is really the orchestration and the platform and overall service working together actually to create that holistic service which is going to be provided out of that VNF yes good question I think they are two different problems problems quite related the thing is that in open stack we need to have an architectural driven approach segmented where we segment the problems that are related to the others and we provide solution to address that segment and in my example I was mentioning the evacuation so the evacuation assuming you have shuttered storage you are okay and that address that specific case the backup is needed absolutely yes that's one of the main part but we try to have architectural driven segmented approach to address specific issues otherwise become really difficult thank you okay I think that's all the time we had thank you very much I appreciate the panelists time and participation and thank you all for listening as well thanks