 Hello, good afternoon everybody. This is an NFE talk, and the topic is increasing infrastructure efficiency via optimized NFE placement in OpenStack Clouds and our team. So I'm Ramke from Brookheath City of Office, and here we have Debo from Cisco Cloud City of Office, and Yeti from Cisco Cloud City of Office. And our goal is to drive innovative open source solutions specifically for NFE with OpenStack. So what does this talk about? You might have heard Toby Ford from AD&T talk about NFE a couple of days back on Tuesday 13th. The key message he would like to deliver is these. Essentially the worlds of IT and Telco are converging together, and the Telco Cloud OpenStack is the infrastructure foundation. And with that, our goal is to transform OpenStack to a carrier grade NFE cloud solution, and we'll deep dive into some high level gaps that still be identified, and we also plan to demo some initial progress on this front. So with that, the agenda for today is quick NFE summary. I will focus on a specific cloud NFE use case, and our goal is to drive some innovative ideas through efficient resource placement strategies. I'll propose some extensions to OpenStack scheduler to achieve this, and a quick conclusion. So what is NFE? Network Functions Virtualization is a global movement by network operators, AT&T, Verizon, BT, you know, operators all over the world, not limited to just United States. So the idea is to transform from the classic network, network appliances, say hardware-based firewalls, BRAS, into a virtualized network appliances which can run on general-purpose hardware, you know, Intel-based or ARM-based, anything, and drive operational OPEX and CAPEX savings. And also, not just that, it's also the increased automation which would drive, but use of virtualized appliances which would drive OPEX savings and also faster time to market. Besides OPEX and CAPEX savings, the operator's goal is also to enable new business models and value-added services. And as you can see, if you're virtualizing an equipment like BRAS, that means you're also touching something like a telco-central office. So it's no more just the large-scale data-central game, in fact, your central office is becoming a small data center. And also, some operators are looking at data centers in the base station, mobile base station, those really mini data centers. So with NFE, you can see that the data centers are stretching much beyond the current definition of the large-scale or hyper-scale data centers to the distributed data centers like central offices or data center, the base station. And that's the interesting angle in which things are proceeding with NFE, which is quite different from your classic hyper-scale data centers. Now, let's look at a specific NFE use case, NFE infrastructure service. So the motivation is, primary motivation is network functions in the cloud. So today, you must know quite very well about this infrastructure, classic infrastructure service, which is offered by cloud service providers. And essentially, it's about compute and storage, opened up compute and storage model. And network as a service, which is offered by service providers, which is focused on VAN. And the whole goal of combining both with NFE is trying to leverage, for example, NFE infrastructure of another service provider and increase resiliency, reduce latency for 3D and use cases, and also address regulatory requirements, especially for energy efficiency. So if you look at where we are today with such a service, compute and storage are treated independent of network and there are no energy efficiency considerations. So this means it's just randomly just combining them and calling it NFE IAS without maximizing the service value because it's still not anything joint. It's disparate, but combined in a single name. And that just wanted to quickly get into what this NAS is. Some of you may already be familiar with this. So if you look at essentially what NAS is, network as a service, one of the common use cases is about bandwidth on demand across VAN. In a sense, if you have some workloads like disaster recovery or on-demand backup, so instead of dedicating bandwidth across VAN, essentially the idea is do dynamic allocation of bandwidth when bandwidth is really needed and then release it later. And the big advantage is, since VAN bandwidth is precious, you can achieve substantial bandwidth savings, especially for elastic jobs like backup or DR. That's the primary benefit. And typically MPLS is used for doing bandwidth on demand. And now, so with NFE IAS, I give you an overview and essentially how compute and storage is still treated decouple from network. So in this model, so let's see where we really want to get to in this, to extract the combined value. So we want to get to somewhere much beyond VAN, bandwidth savings and essentially get to a model of optimal resource placement across data centers. And like I said, it's not just the large-scale data centers, hyperscale data centers, even the distributed, smaller or mini data centers in the edge of the network. And the goal is to increase energy efficiency while maintaining multi-tenant fairness and improving performance. And this would drive CAPEX and OPEC savings, improve a qualitative experience, and address regulatory requirements, especially for energy efficiency. And the popular use cases are disaster recovery, on-demand backup across VAN, and virtual ICDN, especially content prepositioning, where we want to make sure that, you know, well, the content is closer to the user at the data center, which is very close to the edge. Essentially, if you're keeping content in a mobile-based station, it's proximal to the user and you can improve the qualitative experience. So in this, specifically for NFE IAS, let's look at some of the energy efficiency issues today. So if you look at power usage in data centers, and especially as we are transforming to NFE, it's all about servers, and servers are the biggest consumption of energy. And if you look at server-powered profiles, as you can see here, they're heavily non-linear. The 45% of peak power, 20% of off-world load, and notably, the biggest one is around the idle state, sorry, where you can see that approximately one-third of the power is being consumed in the active idle. So this fundamentally means that it's extremely inefficient to keep servers powered on under low load conditions. If you have nothing to do in a server, it's good to just power it off rather than giving it an active idle, waiting to receive load. So and this is all clearly depicted here. What I talked about these numbers are from spec benchmark results for HP Proline server. So this issue is quite public and well-known. So I'll hand it over to Debo to explain on how we get that with OpenStack and the huge opportunity. Thanks, Ramki. So as Ramki has mentioned all the NFE use cases and how optimized resource placement is getting important, in fact, so we would also mention that in a stock. So all that is from a very high-level perspective, we know the requirements, that these are the things that we need to do as a community with an OpenStack. But how do we get there? And the way we get there is taking very small steps, but before telling you what are the small steps that we've taken, consider the following workflow. For example, if you say there is an application that's sitting on top of an NFE solution and typically the NFE solution will have its own API which we will not get into because it's at a much higher level compared to OpenStack. And if you look at the diagram to your right-hand side, there'll be a lot of components of the NFE solution. One of the components is a infrastructure virtualization layer where OpenStack is poised to play a huge role. In fact, it's one of the most commonly depicted blocks. So imagine, now imagine a customer, NFE customer, submits a job request, say a backup with some kind of constraints on elasticity windows and other constraints. There could be many constraints, network constraints, business constraints, different rules. And you have to encode these constraints in a reasonable format. Obviously all those questions are still open. But eventually what happens is it traverses through this stack and then it goes to the OpenStack API. Namely the schedule, it could be the OpenStack API which would then hand over the request to the scheduler API. And the scheduler API, assume that we have a fancy scheduler called the solver scheduler. We'll talk about it in great details. What this fancy scheduler needs to do is the following. It needs to figure out, basically it needs to figure out what is the system state. And once it figures out what the system state is, it makes an intelligent decision on where the resources should be placed and not only where but even when the resources should be placed because suppose you're trying to schedule a backup, it's sometime in the future. And then it collects, once it's decided the exact schedule, it returns back to the NFV business logic in the provider and the provider then sends back to the NFV customer. Now as a side effect, it could also trigger other events. For example, it could trigger an entire workload that would finish one job and start the next based on timing, it could also trigger events that will power down certain servers and so on and so forth. But essentially what needs to be done in OpenStack is, I mean the simplest thing that we need to do as a community is to ensure that the NFV applications, a stack, can actually dictate its requirements to OpenStack in a meaningful way. So we need a very crisp definition of the API, the contract or the contract between the NFV layers and the virtual infrastructure layer. That is absent today. Once you have the contract, I mean, in addition to the contract required, you would like to figure out a way how you could have a smarter scheduler that can actually consume these requirements of the NFV layers and do intelligent placement of workloads. And I think in that particular step, we have some things. I mean, in fact that we've shown that you could do smart scheduler in OpenStack even with existing, even with existing NOVA, but like even in ISOs and that without disrupting the architecture. In fact, we could not only do smart scheduler, we could do smart scheduler using cross services, we're using both network storage and compute constraints. And we'll go over some of that. In fact, when we presented this in the previous summit, there was a lot of interest. However, the killer app was still not very clear to the community. I feel that now we see a real killer app. And essentially what we would like to do is to have a smart scheduler in OpenStack that will use analytics, possibly big data analytics if your entire system is large scale and distributed to determine the current state of the OpenStack deployment. And once you know the current state, then we use resource management techniques or like optimization, et cetera, to pick up resources based on the constraints that the application has given us through the API, which today doesn't exist, but assume that API exists. And the current system state. So that's our goal. And in the scheme of things, what we have done today is we have, we have the smart scheduler that my colleague Yati is gonna talk about. And we've pushed in, for this release, we have an API extension in Nova called the server group which we believe that we can expand and encode a bunch of these constraints. So with that, Yati. Hi, thanks, Tevo and Ramke. So going forward, as a possible solution that we can use to address this use case of NFV, we propose this, actually we propose this in the ISOs design summit and we are actually working, it's an ongoing work. So to give you an idea, what we need is a smart resource placement engine which can actually take in this request from the top. So, and in terms of tenant APIs, what I mentioned here as rules and policies, what I mentioned is like all the NFV consumer requests that come in using a clear cut, clearly defined API, those rules and policies, whatever is requested from the top needs to be translated into some sort of constraints which will be used along with the global states. And this global states could be, we could also use, like Tevo mentioned, we could use some analytics to get, derive some of these metrics. And in addition to that, what we need here which is currently lacking is a way to do cross services. So what we want is constraints that include networks, storage, compute, energy, et cetera. So we want to unify the constraints so that we can come up with an optimal decision. So all these kind of feed into this smart placement engine. And what we propose here is that we can use really fast implementations of Apache license, third party, solver libraries so that we can, we can actually compute this optimal placement. And to give you more details, what we propose is the solver scheduler. So this has an intelligent placement engine. So to give you an idea, a user, what we're trying to do is, is actually to maximize performance or to minimize all the costs. So if we define this as a mathematical problem using all these various costs that you want to minimize and various constraints that come from network, compute, storage, energy, we want to ultimately use this smart engine to come up with a scheduling decision for optimal placement of resources. And we can use the various energy profile, server states and the network link capacities and the system capacities as all that feed into making this computation. So this is our brief idea and it is very complex. So this is an example of how a really smart mathematical problem looks like. So I don't want to bore you guys and get into the details of this, but just to give you an idea, this is what we're trying to say about smart placement. So in the next two minutes, I think I'm going to show you a quick demo. So we actually use the scenario of compute and storage affinity as one of the requirements to make optimal placements as a use case here. And comparing it to NFE use cases, this could be very much applicable for backup services or seeding services where you want to place the service VM close to the storage. So in this particular setup that I have here, imagine there are three hosts there and two of the hosts in one of the racks has these volumes installed, the demo volume one and demo volume two. And I'm trying to place a request to say, no, I'm making a compute request here. So I want to create two VMs. So I'm placing a request saying that give it to VMs which are close to these volumes. So that is what I'm going to show here. And let me quickly show you with the help of a small demo video that I have here. So what I have here is an OpenStack DevStack installation and I have the three hypervisors there, host one, host two, host three. This is my set of servers that I have where I can instantiate VMs. And I need two VMs. And the set of volumes that are there are demo volume one and demo volume two. They are located in host two and host three, like I see. So what I want to see is that can I place a request for two VMs which will ultimately get placed close to or on exactly those hosts so that it's really near, the compute and storage are really near. That is the constraint that I feed into my engine and make a request. So this is a working implementation of a scholar schedule integrated with NOVA without much changes to architecture. This is I'm making a NOVA boot request here for two VMs. And now as you can see here, the VMs get placed in those two hosts. The two instances created are actually on those volumes where I had requested in the NOVA boot request. So yeah, like I said, I think what we, this is just an example of how we can specify a constraint to our solver scheduler and compute storage affinity is just one of the examples but we can use energy and all of this is ongoing work in our team and we are exploring all these options. But this is just to give you a flavor of how a smart scheduler can unify all the constraints across compute, storage, volume, energy and other constraints to give a complete one set of optimal decision. So to conclude, what I would like to say is that NFE is really a killer use case for OpenStack and what I would like to request here is that we have actually proposed a blueprint for solver scheduler and I would like to drive, use this NFE use case as a way to kind of work with the community and try to get this forward and as a matter of fact, like Debo mentioned we have already pushed code for server group API which actually is a first step towards defining APIs which lets you make this complex request with all these constraints and policies defined so that you can actually come up with a group of servers that you can schedule together and in addition we should also look at all the neutron hooks which will give us the network related constraints for making an optimal decision. I would like to conclude here and open up the questions. Can you come to the microphone? Yes, hello. So this constrained optimization, is it having an exposed interface so people can tune it or write their own? Yeah, it is there. So you can actually tune in to feed in constraints, feed in cost, so you can actually feed in your own set of constraints to ultimately solve the problem. Little work inside the NOVA scheduler, NOVA doesn't realize that we've actually shipped the decision-making to the solver. NOVA just thinks that it's the same old scheduler. Right, and this is Keshav from HV. I mean, what are the NFV that you are targeting because there's a carrier-grade NFV entities like NAT, or VLR, HLR, which are on a dedicated hardware whereas when you come to the open stack you are, I mean, you still sit outside the, outside the world and tries to communicate through some other way to open stack or you will sit inside the open stack as a VM? No, so we assume that the NFV business logic is on top of open stack. The business logic encodes the constraints and makes an open stack API call. And the API call then translates those into scheduler constraints or a scheduler API. Okay, so you are not going to fit in the inside open stack framework? So we are, I mean, the scheduler- I mean, currently the kind of HR requirements or the service-upgradation kind of requirements what the carrier-grade will have. So I agree, maybe we should discuss this offline, but for a large class of constraints that are posed by the NFV use cases we believe that we can encode them into optimization constraints and then solve it and place up optimal resources. But there may be one or two edge constraints. Yeah, I think that's exactly we are trying to solve a constraint-based optimization, rather a framework for constraint-based optimization and NFV as a use case, right? Right, okay. And it's very flexible. It's flexible, you can do whatever. But let's open for a chat. I think we can discuss also. Yeah, hi. Great presentation. I have a question about network optimization part because it seems to me that a lot of times when you place these network functions, what you care about is the relationship between these functions. So I want, for example, five millisecond top between these two NFV VMs that are sitting in different places. You don't necessarily, because telcos come from that point of view that they want, actually sometimes guarantee performance between a set of VMs. So east-west in the data center. So what I wanna ask you is this. First of all, can you explain a little more about how do you work on the network optimization part? What do you exactly are gonna get the information from? And the second thing is that do you really think you can have a solution without doing any measurements? Because to me, it seems like unless you can figure out the rack to rack or host to host bandwidth. Bandwidth? Not bandwidth, actually, it's more than that, right? Because you can actually make MPLS observations and then override it and have some more traffic. So what do you exactly, I mean, there needs to be some sort of feedback loop on what exactly real performance is. Actually, that's a great question. So I'll break down the answers into two parts. One is the network optimization. So when you have dependencies between virtual network functions, you can represent that as a workload topology or a graph or constraints. So from the graph, we can generate these constraints and therefore we will be able to incorporate, we have been able to incorporate dependent, I mean virtual request topologies. Now to address your concern about how do you do measurement and all, and that's where we, so what we've done today until now is using static network distances. So we assume, see our solver is a dependent of the domain of how you encode the network distance. We just assume that you give us a network distance matrix just to go into details. So the way that Neutron can really make this thing into reality is if Neutron has a network distance API by which we could get this virtual traffic matrix, we will be able to feed it into the scheduler, but that doesn't exist today. So we had to do the static thing as a hack to make our demo to work. Yeah, in general, any set of metrics, whether it be real-time dynamic metrics, can be, as long as they're fed into our set of rules which we use while solving the constant optimization problem, I think that is all. Ultimately, this is a framework which can take in any set of, it's a mathematical solver. That is the idea, yeah. But how do you achieve that? We need functions, we need functions that provide us those metrics and those can be fed into the framework. So just to add, I think, specifically in the use cases, that's why you took a use case which is more delet tolerant. The example I explained is more of backup for all those services where it's more elastic and you have more flexibility in placing your VMs or anything and storage across data centers. Those are the easier ones to tackle versus some hard kind of, you have some latency jitter constraints. So you have to break it down to use case basis and then start addressing these problems. But our constraints, I mean, do model it, model it. But our constraints themselves as Ethy Debo explained it extremely flexibly when latency could be a constraint. Anything could be a constraint, so. Gentlemen, commendable job and congratulate you for the, at least picking up one piece out of the number of questions for NFVs there. So you have looked at scheduler as one part and enhancing. So there are control plane data plane separation, then the performance at the network layer and then performance at the storage layer, then the latency and QOS, there is plenty of that. So given all that, do you not think that we need to have a broader consideration at top down as a NFV API or API on open stack for NFV in per se, so that you can address path piece by piece. And do you think that you will work with other organizations to do that? So of course, I think it's a very big problem that cannot be solved by a bunch of. Yeah, go ahead. So independent of what API we designed at the NFV layer at the higher layer, the problems that we are trying to tackle will not go away. So we wanted to build, we wanted to tackle some of the core very fundamental building blocks that will be unchanged because the discussion of the NFV API layer is a huge one. And I personally believe that we are not going to have any convergence soon. No, there is no need to be convergence, you can be defective. I don't believe that there will be convergence in NFV, in NFV, in SAR, NFV, so we are happy to work with any group of people. That's why I ask you, do you want to work with other people to get there faster? In fact, just to highlight, we had actually talked about it in this slide, the application information, that's the open API. So yes, and it's a very broad context and there is a lot to do from getting application information, but we zoomed in on specific aspects. Yeah. And exactly, and also the project. Be open to work with other competitors and we will make it happen. Absolutely, yeah. It's completely open. No, blueprint is one blueprint. We need 10 blueprints to make that happen. So that's the, and also to answer your question, it's about the argument, the existing server APIs, cross-creditor APIs. That's exactly the proposal. So for example, we have to remember that all servers are not equal in energy efficiency, may not be exactly the same, that's one of the things we'll model. That's it potentially. So it looks like there are two parts to your presentation. One is the API definition and being able to create the hooks so that you can do what you wanted to do. The other part, which is I think is also massive is the analytics that you can build based on the data you collect. Right. I just wanted to get an idea, like you know, are you more focused on the APIs right now or more on the analytics? And then there's a follow-up question if you. Yeah, so as a group, so we are right now not focused on the API per se because we are hoping that the community will jump in and figure out what the right API is. We wanted to show to the community using a specific version of the API, a very simplified API. So we pushed in some server group API. We've done some simple scheduling. We basically pushed code for review for an entire framework and shown some simple examples. And in this summit, we've shown three talks on how you use analytics to figure out system states and use it in the context of visualization. But we believe that these are small pieces of the same big puzzle, but we haven't come up with this big broad API. We feel that that's a community process. Okay, so let me ask the question on the analytics. I think you showed very quickly, I think Yathi passed on one of the foil where you had cost function or the performance function. Yeah, that is the option I mentioned. Have you done a case study where you can share just a quick 30 second some of the accomplishments? Like if you use the cost function, what sort of case study you might have done and proven results or the optimization? Yeah, so we can take this offline and show you the exact demo videos and the slides. We've shown in the summit, we've had three talks on, but just to abstract what you just asked, yes, we can actually use analytics to find out hotspots and in real time. So imagine you can find hotspots in real time. You can actually then use that to give you, to input a traffic matrix into the solver scheduler, which then can be used to figure out where your resources should be placed. For example, if you have a data center with say five racks and there is a Hadoop job running in rack three, there is a hotspot which we can identify independently right now with the analytics. We can then have a scheduling constraints that will avoid rack three. So in the presentation you had to use some numbers. Just wanted to see like, you know, what? Yeah, we can do that offline. We have a lot of data that we can share. Sounds good, thank you. Yeah, any other questions? Okay, thank you. Thank you.