 Hello, hi everybody, good morning, I am Shamik Mishra from Aresent and I would be presenting today some thoughts about NFV orchestration and the challenges we face recently in telecom deployments. I am quite new to OpenStack and this is my first OpenStack summit and I had actually listed lot of challenges and some of the challenges actually got solved during the summit itself. So, I found there are solutions already existing. So, some of the problems that I might state you might have already heard about it in the summit, but nevertheless let us go ahead with it. So, my main focus area is management and network orchestration of network function virtualized elements and currently I am working in identifying the requirements for NFV Manu architecture and then also trying to specify an extensible and a modular architecture for the NFV service orchestrator and in the process also trying to define what possibly we need to change in the underlying virtual infrastructure manager in our case it is OpenStack. So, this is mostly my focus area and today we would be covering some of these things. So, some implementation challenges and what are those challenges that we are facing currently and we will go deep into one of the challenges that is end to end service monitoring of NFV and how this end to end service monitoring of NFV is also related to the eventual orchestration of the NFV elements and I will introduce a model which we call a service creation model and then we will discuss a bit about a service aware monitoring system and how we can link the monitoring system with the KPI with the orchestrator and the orchestration the overall orchestration of the system. Finally, I will speak a bit about a certain proof of concept that we did for wireless CloudRAN use case and we will look into how we deployed this and orchestrated this on OpenStack and then what were the results and what were the challenges that we faced with CloudRAN. Then if we have time then there are certain things that we think community needs to come forward and perhaps support to get a very efficient orchestration system for NFV done what could be that particular community, communities of contribution is something we can look at it. So, a quick recap of network function virtualization it is a HC mandated standard and here in traditional systems we have several separate network elements like routers, firewalls, the packet gateway, the LTE packet core network, the load balancers, the applications, the distribution switch, the IMSs all are running on dedicated hardware as a separate appliance. Most of this functions since they are running on dedicated hardware lot of unused computing capability is there because these hardware are not exactly designed for optimal usage it is designed for peak capacity and there is not always a peak capacity at any point of time. So, there could be a certain time when we have a peak capacity for the rest of the time the hardware and the computing capabilities of this of these elements are often wasted and they all of them have separate management systems and more importantly separate ways of monitoring them and it is also very difficult to introduce services rapidly and from there towards NFV we are trying to host all of these applications and commodity hardware through virtualization. We use the specifics of cloud for example scalability elasticity and the efficient usage of resources which cloud brings in in order to have a more efficient more dynamic more flexible more scalable NFV deployments. It is of course cost effective but however this is not this is just the tip of the iceberg there are a lot of challenges which go with it. I am not sure whether you are able to read all of this but we took a survey from the heavy reading on the from different NFV clients this survey was done around in 2014 and we realized that there are several impediments towards commercializing NFV and some of them are like the reliability of course hardware and also perhaps the most important thing that we are going to discuss today is how to troubleshoot and service assure and assure the service delivery of the NFV components. So we would be discussing these two items that is the cloud orchestration part and troubleshooting part in a very specific area of troubleshooting and service assurance not the entire cloud architecture as such but more focused on on the applications that are going to get hosted on OpenStack. So eventually network function virtualization is not just bringing your legacy software to the to the cloud so it is much beyond that and more importantly perhaps the biggest impediment is how to ensure the SLAs that an operator agrees how that SLA is actually enforced downwards on the cloud infrastructure and in conjunction with the virtual network functions. A very high-level challenge will go into depth in one of them so it's actually how to realize large-scale NFV deployments for example you might have hundreds of service chains there are several functions which are hosted on on the cloud how to monitor them how to replicate them for example if a new tenant comes how quickly can we replicate for example if we have optimized a network for one of the tenants can we like use that for subsequent deployments can we replicate a complete working deployment from one cloud to another so these are one of the few I would say one of the many challenges that we have in NFV also how to simplify it we are eventually moving towards towards NFV because we want to simplify the service delivery and automate the service delivery so if we do not simplify the overall network architecture then most probably we'll end up in with similar problems that we have in the legacy side as well and the overall network the end-to-end network need not entirely be virtualized there could be certain components which might run on the customer premises it can it can actually be a legacy hardware and so if we are planning to optimize one part of the network then there we cannot just ignore the legacy side as well so if we have to optimize if we have to make certain automated decisions we have to do it for the entire network not just the ones which are hosted on cloud also the dependencies of telecom applications are such that each of them have a very strong dependency with each other when they're chained so if it scale up one of the elements in the chain then most probably we need to touch or modify the other elements the neighboring elements of of that particular scaled element so overall there seems to be a situation where we need to take a lot of decisions decisions on how to optimize the system how to replicate the system how to create larger networks how to simplify the architecture how to simplify the delivery all of this perhaps is driven through a very important aspect which is collecting the monitoring and taking a decision based on those monitoring systems so we'll speak a bit about the end to end service monitoring of nfv and this is a kind of a model that that we are starting to think that this might work so we have a service request this could be a service chain in itself but it could be a collection of services which define an overall service and we would like to transform this service to resource requests why do we need to do that because each service request would carry its own KPIs and thus those KPIs might be quite abstract in nature that I need to have I need my service deliveries in a certain time frame now how does that really matter or how does that really impact the resources which are getting hosted on cloud so we need to transform those KPIs into into certain requirements for the for the resources which are going to be deployed on the cloud for example we have to translate that to the number of cpus virtual cpus what kind of memory we need what kind of networking capacity we need do we need rate limiting do we need to guarantee bandwidth so based on the service requirement we need to transform them to certain resource requests and therefore the kpi also gets translated to a kpi which is more relevant to the resource request once those resource requests are made we need to deploy them of course and then we start to monitor them because we need to dynamically decide whether our KPIs are getting met or not and if they are not then what actions we need to take so we need to of course generate those monitoring we need to collect them we need to aggregate them and then we need to analyze them most probably we need to even visualize them in some way or other and then based on on our analysis we need to take some actions on on those specific monitoring that we are we are capturing now take a very simple example that we have here we have a firewall a load balancer switch hyper visor some network functions and they are all different elements and some of them could be open source some of them could be vendor delivered all of them have completely different ways of managing themselves also there might be completely different ways of generating logs from them from the generating monitoring from them and we however need to understand that if we have to take decisions on the health of the system or the health of the service then we need to have an instantaneous way of collecting logs collecting monitoring and taking action on them so apart from that the complexity could also be that the clouds are interconnected over over vpn and then there could be sdn controllers who are not part of the as such the the virtual infrastructure manager it is a controller which is defining the different service chain the traffic steering and the traffic limiting and how to actually ensure that even the that performance of that particular controller is also taking into account while taking decisions of on scaling elasticity and resource optimization so most probably we have to think of a unified model for an fv this might be a little destructive in nature because we do not have one now so do we can we have a uniform way of collecting monitoring can we have an uniform way a protocol of conducting monitoring which nearly every vendor can follow or the open source components can follow then perhaps we can come to a situation where we can unify the monitoring collection and then actually using it to take decisions and particularly taking decisions for orchestration this is also needed by the os s as well so so the standardization perhaps needs to be done right up to the os s so the current os s may not understand everything that gets generated from the cloud for example the legacy os s may not know what what is a vm death and and why do we need a vm death how a vm death is going to impact the overall services so so there must be a way to escalate issues right up to the os s and then if needed we take actions right from the virtual infrastructure manager that's open stack and right up to the orchestrator so there must be a way where we are able to take decisions uniformly we are able to escalate issues uniformly onto the higher layer and in order to have this uniformity we might need to collect things in an uniformed way so another example that we might want to look at is suppose we have an example service chain where the traffic is getting steered from a b c and d and so this is sort of a traffic path and then for some reason we need to scale up b so we use a load balancer or some some ways to scale up b and the traffic starts getting steered through this other the new traffic starts getting steered through this scaled node of course it will not just work like that we need to perhaps modify or reconfigure a and c as well and once this new node is in in in action and the traffic starts getting steered we might also come to a situation when the subsequent subgraph gets overloaded so the decision making may not just be for that particular node so when you take a decision for a node you also need to take the decision of the subsequent nodes and and and we face this problem quite quite frequently so in and we cannot take decisions of a service in isolation so if one component we want to take a decision we have to consider what is the impact of the entire entire chain for example there could be a decision to that we take that instead of scaling just that node we scale the entire subgraph of that node and and in that way we can perhaps limit the modifications that are required on the on the either side of the scaled node however this this this model has an has a pitfall that it is creating scaled nodes for c and d as well so there could be some wastage of resources we might be unnecessarily creating creating virtual machines hosting c and d which may not have been required so in either case there would be a situation when we need to take a decision whether to use a load balancer whether to scale the entire graph whether to scale one of the nodes so we need to take those decisions and as you can see the decisions cannot be taken in isolation therefore we need a way to instantaneously decide what is this what is the situation of the entire entire graph at that point of time so so orchestrator's devices who takes this this decision on scaling has to take the impact of the entire service going forward i'll just check whether i'm running out of time well okay so going forward we let's also take a situation where there is a need to have different kind of monitoring for different kind of states see for example a vnf has certain states like install start stop monitor maintain run so there are different states of a certain vnf and for example in this case we have a b c d and all of them are running so they have a certain state all of them are in run run state and then we can derive a chain state based on the application and its corresponding state so say that we call that chain state say as ch1 now what we will monitor what what kind of monitoring we will start for for the entire chain depends on the state so when it when this when the element is running it we would be we would be picking up certain statistics when the element is not running where maybe when it is in a maintenance mode one of the elements is in maintenance mode or there is a load balancing situation one is getting scaled up there could be multiple situations where the state of the chain changes what you require to monitor now at this situation when everything is running and what you need to monitor when one of them is in a different state would be different so we need to have a state aware monitoring system because of the simple reason for example if if b is being say under upgrade so we would be complete we would be focusing on a complete set of different parameters while b is under upgrade whether the upgrade is successful whether we need to do a rollback whether we need to do something else with if the upgrade doesn't succeed so a completely different set of monitoring so a different set of statistics are required at this state compared to when everything is in is running we might have a completely different set of statistics that we need to take so the state of the chain also changes the state of the monitoring also changes so eventually we would have to make a model which not just takes into account the entire chain's health we also need to take an account the individual state of the of those elements and then aggregate them to decide what monitoring we start what monitoring we collect and how do we aggregate those monitoring and then take a decision based on that now coming to if if you recall one of my earlier slides I said there was a transformation function which is transforming a service request to resource requests say for example let's take this case where there is a there is a chain somewhat hypothetical example but it helps so the the traffic there are two kinds of traffic one a b c d and one will be a b c e f and there would be different k p i's at different interfaces for example there could be a k p i between a and b there could be a k p i for the entire delivery of service from a to d or from a to a to f so there would be multiple k p i's in a single service request now what what we try to infer from our earlier thesis is that we have to convert this service request into multiple resource requests so what could do could be those resource requests so these could be computing resource requests so in order to ensure this kind of a k p i we might have to convert them into resource requests like what what what could be the c p u what could be the memory so there could be computing resource request there could be networking resource request if we for example the if there's a need for guaranteed bandwidth or there could be placement request so there's a certain certain element may require a node which has s r i o v or p c i pass through enabled or or it needs a certain hardware accelerator it's it's quite common in in NFV situation so there could be a requirement on there could be a placement request and also the placement request could depend on the constraints that we have of that of of that particular infrastructure and but at the end of it we need to ensure that the k p i's are met so we have that we have to transform the request in such a way that we are translating the service request to the resource request for example the this eventually turns out to be something like this where there is a certain traffic flow so we know a certain steering has to be done so there is a networking configuration which sdn controller does or say the a to f traffic flow is is is different is differently done so we have now a service request we have a collection of resource request so how does this model actually then works so we have a service orchestrated so this is just an example and a and a controller we have infrastructure managers we have other networking devices we get a service request which is characterized by the k p i's from the service request we do the resource decomposition and then for example we create the network chaining request so sort of networking request which the sdn controller then configures the networking services or sets up the staff traffic steering then there would be a set of computing request that could be a set of placement request and which the which the service orchestrator is actually decomposing this and they're configuring individual controllers and these individual controllers are then configuring the the infrastructure so the placement request could lead to resource requests some placement request some resource reservations so for example in in in NFV we just cannot uh believe in resource allocation at runtime as best effort we have to do some kind of resource reservations otherwise we might not be able to guarantee the service so so those decisions have to be taken so those decisions also have to be driven from the k p i itself then perhaps we start instantiating the virtual network functions one by one in the chain so we will instantiate the resource the vnf will get created the vnf would get configured so eventually the entire service would get deployed and then we start monitoring them so when we say monitoring them there could be multiple points in the resource decomposition model where we need to monitor for example we might need to monitor at the time the resource requests are getting deployed or we have to monitor how the resource decomposition has been done because we need to link back the monitoring when when we take it back to the service orchestrator from the monitoring right up to the service service adherence we have to first break it and then join it yeah so this is the model that we we tried to create is a simplified model life will not be so simple eventually there would be a lot lot more complexities involved there will be a lot more resource creations a lot more resource decompositions but this is just a simplified model to just understand whether this will actually work or not but one important thing is that how to actually monitor the end-to-end services now we have decomposed it so we had the service request we decomposed it we have deployed it now how to monitor it so say again we take a simple example so the infrastructure monitor can monitor the different compute nodes the networking nodes so basically for example silometer or any other telemetry methods can monitor the infrastructure as such but how to monitor the the elements which are running inside the virtual machines so most probably we need something there so we have to think of a service aware monitor which not just considers the the network elements which are hosted but also considers the state of the network elements and those network elements then have to be have to be planned in the throughout the service monitoring method and the service monitoring would be defined on the service templates what kind of service level agreements we have done and the model that we have decided the question is do we need to standardize these containers which will contain the the monitoring data the statistics which are getting generated by the application do we need to do that I would say that it's quite debatable but most probably if we have to have an efficient way of monitoring and delivering services on an NFV we perhaps need to need to think in those directions so we created a simplified architecture for a roof of concept so we had we use the standard open stack monitoring methods like silometer for storage compute and networking for for the different network functions that we hosted we had access to those network functions so we created agents which which can generate monitoring based on a certain protocol that we decided the protocol is actually stacking the monitoring so say for example if the traffic goes from A B C then we stack the monitoring C B A so so this is just the monitoring's are on one top of another in the in the direction the traffic is getting steered and then we store it in a in a database and then we have a monitoring initiated in an aggregator which is part of the controller system which initiates the monitoring based on what states what is the state of the monitoring system if there is a certain action that is being taken on the VNF then the it also considers that and then the service orchestration orchestrator based on those those parameters can take a decision on on what actions it should take in terms of scaling whether it needs to configure the resources whether it needs to initiate some kind of service healing or it needs to look at the different alarms that get generated while doing this we also realize that we can push some of these functions downwards towards open stack so the earliest we can take a self-healing option the better so it could well turn out to be when we are done with this that we are we eventually figure some of these some of these elements are actually pushed down towards towards open stack towards the virtual infrastructure manager and bring it as close as possible to the to the network network function okay so we did a simulation of this of this model so here in this simulation we took service request of which consisted about some switches some routers some firewalls some load balancers some data generating VNFs and then we then our model converted those service requests into resource requests we implemented service aware placement which is basically a linear programming with latency routing and the switching capacity as as the key constraints we also used VNF states to start and stop monitoring or or modify monitoring we used two states in this in this simulation install install and run and we did the service aware monitoring and and this date and this timeline or what results that is that is visible is actually from this from the start or the receiving of the service request and when the service first service monitoring gets gets collected and aggregated and and and and when reaches the orchestrator roughly it turned out to be linear of course we need to further dig deep into it there are this the the decision making or rather currently the time of the evaluation time for adding service requests to this model is dominated by the by the service placement because it's a linear programming which which which solves the problem the constraint problem but we feel that when as it becomes larger and larger and there are more and more service requests getting added this could eventually the the optimization needs perhaps needs to be done at the decomposition and aggregation part so few in future we would work more on this and try and see if this model eventually gives gives us some kind of an information on how to improve the the the models decomposition and aggregation part okay so we will now conclude in terms of what are the key conclusions from this from this well project is that we were able able to decompose service request to resource request in a structured manner we had a vnf state of air monitoring model where we could aggregate data perhaps we need standardization of monitoring models where all nfv or vnf vendors come together or the open source community comes together and and starts deciding on what would be the way what would be a standard way of of collecting and you know exposing monitoring information for example statistics we also know that there is a need to have a service aware placement i did not cover part of this but of course the service aware placement is a linear programming model because lack of time i did not cover that but yes the service aware placement is also very much part of the overall orchestration model and also we realized that there needs to be a modularity in the in the architecture in the sense that if a new type of service request gets added then it should not disturb the the existing service request so we should be able to add new service request quite seamlessly okay so now we'll come to another use case which is not exactly related to what we just discussed but this is something we did very recently so i wanted to share the results so it's a cloud ran use case and cloud ran is actually a mobile broadband solution for for lte where the base station has a layer one and and the base station has the the baseband pooling or the baseband processing is done at the other locations where the antennas are and the remaining part of the enode b software is actually hosted on cloud in this case you can see in our in the proof of concept that we did we used a host real-time linux and we use the kvm hypervisors there were you can see there are two enode b's in each enode b the software functionality was divided into two virtual machines one containing the l2 part the protocols like mac rlc and pdcp and then in the other other virtual machine we hosted the l3 part so basically the layer 3 telecom part the radio resource management and the oam part so this was the overall system view of of the things and we used open stack juno to to realize this and so you can see if this from this diagram that the enode b1 and the either there is an enode b2 both are hosted on on different compute nodes they are on separate they are communicating over external ip's and not internal ip's of private ip's i mean to say the virtual epc which is the s1 interface of of the cloud run solution is also hosted on on the compute node the antenna side the layer one is separate it's on the public network of course it was this entire proof of concept was done within a lab but so it is as public as it can be and and then we created the the different interfaces we abstracted the internal the private ip's with the with the external ip's and we had some very good results in fact so we realized that we could reach an uplink throughput of 60 mbps quite easily and we had a downlink throughput of 120 mbps more importantly the cpu load was increasing linearly as we moved moved to higher throughput so that meant that in the virtual machine system even when in lte we have a very small tti which is one millisecond that the transmission time interval so the scheduling is quite fast so in each each tti lot of us gets gets scheduled and in spite of that the the cpu usage was fairly linear and and and we did not find any bottleneck as such when we when we just monitored the throughput we we turn we got into some issues when we added more and more mobile stations but then those are mainly because of the fact that the non real the real time scheduling of the u is in a one particular maxed slot was getting was overloading the system and the each each virtual machine was a simple one one cpu to gb memory so it can actually you can make larger ones and and improve the efficiency as well but the more important thing is that this this this was the this proof of concept was was possible and we could see a very distinct trend of how the how the cpu was performing some of the key observations so the key observation with cloud ran was that the path there was a lot of packet buffering and and loss in the initial stages primarily because it was a it had a one millisecond transmission time and we solved that through by enabling PCI pass through the other problem was to get live VM migration work for that particular VM where the MAC is located for example if I just go back so so this particular VM where the layer 2 is located so the doing a live VM migration of that particular virtual machine was quite challenging because