 I'll try this again. There we go. Welcome. We're going to spend a few minutes talking about Data Plane on OpenStack. We're going to look at strategies and tactics for deploying a highly available Data Plane on OpenStack. My name is Hank Green. Godham, Divagi, and I will be presenting together. We've worked for AT&T. We've split the presentation into two parts. We'll be taking a look at the platform to deploy a highly available Data Plane. And Godham will be exploring VNF strategies and tactics for architecting a highly available Data Plane. Platform features. What is the platform? We're obviously at an OpenStack summit, so OpenStack is the platform. But if we wanted to do automated placement, heat is a tool. But is heat the only tool or the tool that's going to help us achieve what we want to do? What about setting up a control fault loop detection? How are we going to do that? We need that to deploy a highly available Data Plane. We'll explore auto recovery, two scenarios there, and we'll quickly take a look at SRIOV. To thread this conversation together, we're going to use an imaginary company, VoIP VNF Incorporated, and we'll walk through all the steps to deploying a highly available Data Plane with a VoIP VNF. But first, we have to build an infrastructure with a highly available Data Plane and put in some components that make it highly available. So if you look on the left-hand side, you'll see a single data center, a single OpenStack region. If you look at that all the way from the hardware, mechanical, electrical, racks, placement of routers and switches, the best you should plan for is 3 9s availability. So that doesn't help us achieve the Data Plane we want to use. Step all the way to the column 5, and we deploy four regions, stitch it together at layer 1. These are geo-located, separated regions. At layer 2, MPLS over UDP, MPLS over GRE, and we have a network stitched together, and we can use things like floating IPs, the standard stuff, HA proxies, load balancing, availability zones, those kind of items to deploy a highly available Data Plane. If you are familiar with the resiliency studies, 4 seems like a bit of extreme, but there's a scenario that you have to take into account for planned down times. It may be moments where you need to take a region out to do upgrades. So let's say your region's out for a day for doing an upgrade. With that in place, you have a 5 9 availability Data Plane. One other tactic for a highly available Data Plane is dual LCPs, as you see in column 3. Same reason, upgrades. Now let's expand the picture a little bit more. In column 5, you see the four regions. Let's multiply that by 20. So we have 80 regions, 75 or so in the United States and sprinkle the rest around the world. That's a very large Data Plane. How are we going to do automated deployments? How are we going to set up our control loop fault detection and do auto recovery? A few other presentations this week offers the answer. And the answer is the Open Network Application Platform, ONAP. So we're going to walk through the components of ONAP that would do an automated deployment. Now there's one thing you need to do an automated deployment. And that would be you have to know what's out there. So on the right side, you see a component called Active and Available Inventory that has all the information about all the regions out there, hosts, networking, storage, so that when you deploy your VNF, you know where you're going to put it. That's component one. The other component of ONAP is the service design and creation. So now we bring in our invented company that has provided us with a VNF, a VoIPVNF, and we upload the VNF. An interesting thing to note about the VNF, the licensing characteristics follow along with it. So we upload that into the Active and Available Inventory. Now we go through the design creation of our VNF and we configure it with the performance characteristics so that when certain thresholds things happen or if events are occurring where our VNF is degrading, then actions will be taken. Once the VNF is finished being made, created in the design creation phase, a heat template gets passed off to the Execute and Manage component. So we'll dive in there and take a look at how the components involved in deploying the VNF. OK, so box one, the Active and Available Inventory, we've already discussed that. Box two, MSO, the master service orchestrator, is the component that does the automated deployment. Now components three and four are the items that begin to put our control loop together. Item three, the DCAE component. We love our acronyms, don't we? The data collection and analytics events component. It collects all the information. Nagios, Solometer, SNMP traps, syslogs. I think there's something else in there that I might be missing. But information about our VNF flows into the DCAE component and then based on the triggers that we set in our service descriptor, events will occur. But we need to close the loop and when the VNF gets deployed, controllers are put in place which will take actions as needed for the VNF. We'll see how that goes in a second. One other thing about the DCAE component is it also helps you with your capacity planning. You can see what your regions are reaching certain limits and then you can decide how you want to expand your environment. So let's dive into the master service orchestrator. So our VNF has been described in a heat template. It gets passed to the master service orchestrator and item number two, we'll go off and do the automated deployment. So now our VNF is deployed in the environment. Then network controllers, infrastructure controllers, application controllers get deployed and we can now control our VNF. Information is then fed through the DCAE component. Oh, and another point is the active available inventory is updated so that we have accurate information about what's happening in our environment. We now have a closed loop scenario and we can then take actions based on how the VNF has been deployed. But let's explore the automated deployment and look a little deeper and some options that are available coming up. And by the way, the VNF is now deployed. We've achieved our automated deployment. So some of you in this audience may have heard of a project called Valet. It's in the technical evaluation process right now in the OpenStack community and we're excited about it. Some bullet points on Valet. It's a heat level scheduler so it ties into heat as it is now. The key thing, the thing that makes Valet, the purpose of Valet is to do this. And Valet is a great analogy for it. So just think of a parking facility, cars are coming in and they're placing the cars. And as I say, a fiat comes in but the Valet parks it into a spot for a Hummer. What we're really looking for is optimizing the places of our VNFs. And so we don't really wanna put a fiat into a large parking space when we're gonna have a larger VNF come down the road. So the idea is the heat template has a holistic view of your application and what Valet takes a look at is a holistic view of your region and then it will optimize the placement. Valet has some extensions to it. Valet Pipe looks at your networking characteristics and the resource group combines the placement of your VMs. A very interesting article that describes the optimization process is listed here, the Austro. You can Google for the document and it is a very fascinating place to look at how the algorithm works. Valet, since we're looking for high availability, is also highly available. It uses Cassandra Zookeeper, HA Proxy and it's designed to avoid the split brain scenario. So we have a highly available placement of our VNF using Valet in an optimized manner. That way we're using our resources in our regions as best we can. So let's dive a little bit deeper and take a look at how Valet works. So first we're gonna focus on the green boxes. Your heat template comes into the green box. Currently your VM would get placed on Nova and then it would go off and grab Cinder and it might not necessarily do it the way your application needs. So when we plug Valet in, when the Valet creation gets called, the plug-in gets called at that point and then the Valet API service sends it to the optimizer. Does a sanity check on the heat template and now the optimization kicks in and the optimizer will compare the Nova Cinder and networking components. Find the best places for what you're trying to deploy to. It grabs the IDs, reserves the IDs and then passes it back up to the heat template and then heat goes off and does your automated deployment. Again, this is a project that's in the OpenStack community under evaluation right now and I think it's an exciting place to go and take a look at some of the detail components for OpenStack and how it all works together. So we've got our VNF deployed. Now we want to look at some of the control loop scenarios. We have our VoIP VNF. It's deployed in the Boston region. Amazingly enough, there's a conference, there's a lot of people coming in, a lot of people making phone calls and we're reaching a certain threshold. So the VES virtual event stream sends information back up to the DCA component. The policy engine kicks in and says, okay, we're reaching a threshold now what we want to do is deploy another VNF, master service orchestrator gets called, the app controller gets called and another VNF is deployed. We have a highly available data system. This is cloud aware, the things we're trying to do. So conference ends, everybody goes home. We don't want to be paying for the licensing on our VNF so we want to wrap it up and then we reduce our VNFs. So that's a quick look at a first closed loop scenario or control loop scenario. I'm going to slide very quickly over SRIOV, kind of as an introduction to why you need it. Okay, so when you're talking about voice calls, you have some very high stringent performance requirements. The package needs to get in and out fast so that the conversation is a pleasant conversation. To achieve that, when you see the purple VM or our VoIP VM, it's tied directly to the NIC card, a PCI NIC card. The PCI NIC card is divided into two components. There's a physical function and a virtual function and the virtual function gets configured to your VoIP VNF. You now get near wire line scenarios. However, there's a challenge with VoIP and that is live migrations are a little bit more complicated if not possible. So let's take a look at a scenario to auto recovery in case we need to, an error is occurring and we want to recover from that and keep our highly available data plane. Okay, so the red box is our VoIP VNF that's going into a degraded mode. Information is being sent to the DCA component. It realizes that the situation is occurring and we need to correct it. So the DCA component, this auto recovery scenario is called make before break, by the way. And what happens is it recognizes there's a situation occurring. We're gonna deploy a new VNF and then, and I'm sliding over a fair amount of technical details, the state date of the degrading VNF is copied over into the newly instantiated VNF and the call is then rerouted to our newly VNF. This has been done in the lab's environment. This is a little mixing of apples and oranges but the time scenarios that they've seen from the time that you cut the last packet into the degrading VNF and the new packet is seen in the newly instantiated VNF is in the range of 100 milliseconds. Barely, probably not noticeable or barely noticeable in a phone call and we keep our data plane up and running. So we've done an automated deployment with ONAP. That's exciting. We've taken a look at two control loop scenarios. Quick look at SRIOV and we're excited about the valet going through the technical process and ONAP is exposed here. I now turn it over to Gautam who will explore some strategies and tactics for designing your VNFs to keep them highly available as well. Thank you, Hank. So what I'll look at is VNF resiliency. What we'll start off with is some very brief concepts on what we'd like for a five-nines available VNF and then what we'll do is dive in through a use case and kind of explain concepts over the use case. So essentially, if you look at it conceptually, what we want to have is solving resiliency is a four-pronged effort. We have something from a VNF resiliency perspective. We need some requirements. The second part is those requirements may differ based on a type of VNF you may have. Platform features, there's several of them that Hank has already described over here and then what we'd go in and do is look at some guidelines called resiliency dimensions. What we'll do are, I mean, these are essentially concepts so I'm gonna go very quickly through these and get to the use case because that's really the more interesting part. So let's look at it from a requirements perspective. These are fairly well-known. What we want is service continuity, avoid single points of failure, we want redundancy awareness, how is the VNF deployed? Active DR, active passive, active active. Best thing is active active, but very hard to do. We want effective failure management. We want good failure recovery. Big things in failure recovery are correctness of recovery, the make before break that Hank was talking about and the latency of the recovery. How fast can you switch it over? Correlated failure. So this is, a lot of times your VNF is a service, it's not individually deployed. And when you chain a service, you're really only as good as the weakest link in your chain. And last but not the least is cloud awareness. So when you have a VNF deployed, your VNF needs to be aware of how it is deployed in the cloud and most importantly, it needs to know the health of your cloud. It's not separate from the cloud, it is in the cloud so it needs to know the health of the cloud. Okay, so we're categorizing VNFs into three broad types. And the reasons for this is gonna become clear pretty soon. We'll have, the first type is, at what layer does your VNF operate on, right? And essentially, here's the reason why you want to know that is a VNF at layer N cannot use a platform feature that is at layer N plus one. Example, a VCE cannot use VNF. VCE operates at layer three, DNS is at layer four. How is your VNF managing its state? Your easiest VNFs are the ones that are stateless. For example, a DNS is stateless. Your request could go to anyone of redundant instances and you could get a response back. A lot of times your VNFs maintain significant state, it caches them, there may be external state, introduces a lot of complexity with respect to state replication and data consistency. How closely is your VNF associated to your physical layer? Using SRIUV is pretty closely associated to the physical layer. And I think a lot of these, the closer you are to the physical layer, the harder it becomes to take advantage of cloud abstractions that can manage resiliency for you. Resiliency dimensions are fairly high-level guidelines. There's a lot more detail in the on-app cloud readiness guidelines for VNFs about these. To give a few examples over here, I can just go to the last one that's monitoring and dashboards. Essentially, do you have the right things monitored for health of the VNF? And this goes into, do you have the right metrics that you're capturing or you're just measuring a simple heartbeat? And that makes a big difference when you look at the health of the VNF. Okay, so here's my best practice for a best practice. You know, let's go through an actual rainy day scenario, see what's happened, we'll sketch out a solution outline. Our sample VNF is the voice VNF that Hank was talking about earlier. It is that layer four and it is stateful. So, first thing. I'm at the conference, I'm trying to call Hank. He sounds, not like Hank, he sounds like Darth Vader. Hank's, you know, still, he's a good guy, I've worked with him for a while, you know, so I know he's not more over to the dark side. It's got to be my VNF, that's probably on the dark side now. Cloud failures. So, remember, we want a five nines available VNF. That means I have 5.26 minutes of unplanned downtime a year. My cloud is at a three nines available. I forget the exact number, but I think that's about four hours of unplanned downtime a year. And therein lies the big difference. What I'm seeing is that even though I have a VNF that is supposedly active passive, it's not fail, I mean, a cloud failure is making it skip its availability requirements. What I'm also seeing is some redundant instance crashes. I know the VNF is deployed active, active locally. It has redundant instances in the same zone, but sometimes just instances crash. I'm losing voice streams, results in drop calls and all the bad experience that goes with that. So, to skip all the investigation details that go into this for several days, weeks, whatever it may be, what we find out is that the reason Hank is Darth Vader is because we're not monitoring for packet loss. Voice calls, voice VNFs are very susceptible to packet loss. You don't monitor for packet loss, you're going to sound like Darth. High impact latency. So, although we're active passive, DevOps has a very high latency to switch over from active to passive. This latency for failure recovery is impacting the way we shift our VNF and from the cloud zone that has failed into the cloud zone that it needs to go to. And why are redundant instances crashing? This is an example of correlated failures. I have a load balancer in front of my instances and it is sending it unbalanced load, which is why some instances are more overloaded than the others and they crash. So, let's look at some solutions over here. And I think one of the main points over here is I'm going to point out some fairly glaring gaps as well. So, the availability piece is the simplest because what we can do is we can use things like global load balancing, platform features like global load balancing, set up effective heartbeats and make sure that there is an automated way to shift over from your active to your passive instances. Let's look at packet loss. So, this ends up becoming very interesting. So, as a first step, let's say as a developer, what I do is I monitor for packet loss and I set up some sort of heat orchestration that will recreate my instance when I see that there's packet loss on that VM. The problem being, are we creating the instance correctly? And in the sense that when I recreate the instances, am I placing that instance on a server that has better packet loss characteristics than what I already have? And I think that goes into a lot of interaction between how your orchestration mechanisms need to be more dynamic in their placement algorithms. To take his analogy about placing the Chevy and the Hummer into a parking space, the thing is you should not go in even though you may find a spot for the Chevy, if someone has put a shopping cart in that spot, the Chevy shouldn't be put in the spot with the shopping cart. It needs to avoid that spot. Now, if you go into, if you look at it in further detail, let's say you have a problem in your underlay that is causing your packet loss. What do you do then? And the thing is, if you just have local orchestration, it's not going to help you because your orchestration mechanisms just going to keep thrashing about, trying to recreate VMs over on different machines and it's not going to work. And that's where you need a global mechanism like ONAP to step in. ONAP needs to be aware of your packet loss metrics. And it needs to be aware of thresholds that can declare a zone as dead when reliability metrics fall below a certain threshold. And that's when ONAP, using your global load balancing mechanism, should be able to shift traffic over from an active to a passive zone, although your active zone may not be dead. But for all reliability purposes, that zone is not operable. Again, this correlated failure problem was because of an incorrect load balancing mechanism at your load balancer. We're having long sessions. We need to use these connections as opposed to around robin load balancing. And because of that misconfiguration, we had issues with instances crashing. The problem so much is not the misconfiguration, but the fact that I don't have metrics today that can tell me what sort of concurrent sessions are coming into a VM and what the flow rate is. Because if I had known that and I tracked for that, I would have immediately known that there is unbalanced load, and that's causing problems with the crashes. So a big gap over here as well, if I go one slide back over here, is testing this mechanism. And I will talk more about testing such things in the future. But to say the least, I mean, when we test for failure, we really need to look at inducing packet loss at various points and making sure that the mechanism for failure works when that packet loss has been induced. Introducing packet loss at things like the underlay is harder. I think that's a big gap. But we should be able to kind of simulate it using trying to induce packet loss at all VMs maybe on your tenant. And that should allow for some testing. But it's a gap that today we cannot effectively simulate something like that of an underlay packet loss. So before I move further into the testing realm, I'd like to point out some desired improvements that we, you know, that from our side for VNFs. One of the first things is a more modular and a more open API-based VNF as opposed to monolithic and closed API-based. Second thing is you really would like VNFs to be designed active-active. It's a very hard problem. Given the performance constraints that VNFs are under, it's active-active VNF is a very hard problem. But we'd really like to get there. Third thing is using standard platform features and well-known software libraries and images as opposed to proprietary images and so on and so forth. And last but not the least is effective resiliency and performance testing of the VNFs. Essentially, what happens when you break VNF components under load? So that brings me to my statement over here that for software and hardware, just about everything is going to fail. And what you can do is really just expect the failure and design around it. Let's go back to our load balancer scenario. We just have an active passive load balancer, because that's how load balancers work. You've got a floating IP. There's a heartbeat between those load balancers. You've got traffic into a VNF. And let's say the VNF has external storage. And there's heartbeats between the load balancer and your VNF. There's several points of failure that you can point out here. The first thing is, do your heartbeats equal to the health? And is your heartbeat just doing a layer 4 connect? Or is it actually running a synthetic transaction right down to the storage? Because if it's not doing a synthetic transaction right down to the storage or right down the chain, your heartbeat isn't really testing the health of the VNF. Again, how are you monitoring or how are you detecting congestion at various points for CPU congestion, network congestion, congestion and storage? Pulse loads is another topic. So if, for example, your active load balancer fails, is there an intelligent mechanism that we have to transfer load to the passive? And this is, if you are at a very high load, pulse loads can actually kill your passive. So this is important as well. Various other mechanisms, failures that can happen. The load balancer process inside, the VM can fail. You have network congestion. What happens when you lose a rack? What happens when you lose a server? So a lot of the point of resiliency and testing is really, we know we have some designs to overcome failure. We really need to make sure that they work. They work under the scenarios that they were designed for. And that our mechanisms to remediate and mitigate those scenarios are verified effectively. The main thing is resiliency testing is really hard. In the sense that it's non-deterministic, you cannot say that a specific test has worked or not worked. You have to be, you will be forced to define measures to define your success, essentially KPI, things like throughput, flow rate, how long has it taken to detect, how long has it taken to recover. And you have to test it under a long, long running cycle. Simulation of failure events is another big thing. For example, in the example up there, stopping a service, gracefully stopping a service and then checking if your failure mechanism works isn't really a resiliency test. Killing the service, yes. That, I would say, is a resiliency test. You have to kill the service. Even so, there is another example that I can give with killing. So let's say I have a VNF and I say, I'm going to do a kill minus 9. And I see that my failover mechanism works perfectly. I'm going to say, good. But what happens if you do a kill minus 3? And we know that when you do a kill minus 3, it's going to core dump. Imagine you have a VNF that is caching a lot of internal state. Your core dump is probably several gig in size. So unless you set your U limits properly and you set no core on your U limits, it is going to write several gigs of data to your disk. And those are the kind of degraded conditions that you want to trap for with resiliency testing. So in short, and in closing, I'd like to say that the design that we use to achieve the five nines involves trapping for, involves very extensive monitoring and orchestrating that through well-known platform tools like Onap and Valet and so on and so forth. The big gap we have now is testing those designs. Moreover, testing those designs both for the VNF and for the platform. And that's essentially a call out for our LCOO, the Large Contributing OpenStack Operators. We do have an extreme testing initiative with LCOO. And I would like to call that out and request everybody to participate in that initiative. And with that, thank you. And we can take questions now. If you can step to the microphone. So Bill Welch, I'm coming from Sonos Networks. From an application, you know, VNF provider perspective, where is the biggest weakness that you're currently seeing in the applications you're deploying regarding all of this? So what I'd say is, what we've seen is that the VNFs are, I'd say, the virtual clones of their physical deployments. A lot of times, you see VNF that require just a single VM with a huge number of cores. And the VNF internally may be modularized. But when it's deployed, it's not as modularized as we'd like. The thing would be that you really need to get, how do you say, think microservices from a VNF perspective. Because when you go into the cloud, you're not a network element anymore. You're a distributed system. And I think that's the mindset shift that probably should happen. So I think probably everybody's made that point this week along the way. So a follow-on question. So besides the microservices aspect, which I agree with you as a critical component of essentially right-sizing the applications into a cloud environment, other kind of essential tools that are essentially missing from a lot of the implementations you've seen? So a couple of things that I can call out on is more standardized libraries and standardized images. For example, you don't want to really go into using proprietary operating systems, because that just makes putting, let's say, monitoring mechanisms on the image a lot more difficult. If you're using things like DPDK and stuff, there's sometimes a tendency to roll your own TCPIP stack over DPDK. The thing is I'm not aware of any right now. I know there's some open source frameworks that do the TCPIP stack over DPDK. But I think if we have a standardized library that does that, that VNF vendors can use and collaborate on, that would be awesome. Great. Thank you. Sure. I think Hank had mentioned about Voight needing things like SRRV in order for it to perform well. But then in the next slide, it was saying that SRRV wasn't compatible in a lot of cases with HA. So I was kind of curious if it's needed, in one sense, to have the good voice quality, how are you getting the five nines with your platform if it's not able to do things like live migration and stuff like that these days? Go ahead. OK. So I think one of the things is this is still in the lab at the moment, the make, perform, break mechanism that he used. One of the things also is that you don't really, if you can't live migrate, you have to make a choice on recreating the VM. Yes, you lose the sessions that are currently in progress, but you won't put the VMF out of service. So it's a question on, it's a balance between reliability, how much reliability you want, and all that, and the availability of the VMF. OK, so you're just spinning up another VM. You're not live migrating it then? In this instance, that's correct. The idea is that the packet loss is at the 100 millisecond level, and so that you barely even notice from the human here that anything was lost and the data plane is still up. This is, as I said at the beginning, or in that section, currently done in the labs. They ran the test from Chicago, LA, and Sococcus, and was able to see those kind of performance numbers. The other question I add as far as what you're looking for from VNF vendors, with the monitoring of that, is it just simply to have SNMP capabilities, like MIBS and stuff like that, in order for you to keep track of when these VNFs are running out of steam, when they may need to be replicated? So DCA is monitoring a variety of elements of your region or your data stack. Could be host, it could be network, and it could notice other elements in that data flow that are degrading it, and so it would do the cutover, make-before-break scenario, and that kind of thing. But how are they literally interacting before? Is it simply SNMP queries? SNMP works, notifications that go to, yeah. I mean, SNMP works most of the time. I mean, that's typically a standard way to run monitoring. I think it's more towards how, I mean, I don't think the monitoring is as much of an issue, provided you can define the KPI that are most important to the VNF. What's more important is that on the resiliency and the performance testing side is that we can get down to those scenarios that actually cause the degradation and then see how the monitoring reacts to that degradation. Part of the design and creation phase, as you saw at the beginning of it, is you would do testing scenarios before you deported it in your production environment, and the KPIs that you're monitoring, and the tests your expected fault scenarios would have been put in place and tested prior to putting that into the production environment. The ONAP site describes the process for your design and creation. And part of that is a test phase before it goes into production, and you can see whether or not you're monitoring the things you want to monitor. And as I mentioned, DCA is monitoring a variety of things, Solometer, NetCool, Nagios, SNMP, SysLogs. So there's a variety of angles that the VNF is being looked at. And then as you design your VNF, you could pick the parameters that you want to look at. This is all beginning, so we're all moving forward and trying to figure out the right dimensions to look. All right, thanks. Is ONAP itself running on the same open stack or outside of the open stack? Outside open stack. I mean, am I running on open stack, but it's not in the same regions that your VNFs are running? So going back to the resiliency of ONAP itself, is where I was, I mean, how do you do that? Put it on a region and load balance it and all the standard characteristics is so... Yeah, yes, yes. At the control plane level that you're doing at the data plane level. Okay. If there are no other questions, I guess this session is complete. Thank you very much for attending. Much appreciated. Thank you.