 Welcome, everybody. My name is Clayton O'Neill. This is Sean Lin. We're both principal engineers at Time Warner Cable. And we're here today to tell you about some network architecture changes that we've made recently that we think are maybe a little unusual, and we hope that you'll find interesting. So at the beginning of the year, we were just to give you a little bit of back history. We were running into some problems, and just giving you kind of where we were at that time. So we were using Neutron with OpenV-Switch, VxLand, Tenant Networks. At that time, we were using Kilo. And we had all of our virtual routers hosted on one of three control nodes. We didn't have HA routers in place. And part of that was because we weren't sure how mature HA routers were in Kilo, but also because we were using the L2 population driver, which wasn't supported with HA routers in Kilo. So at the beginning of the year, Sean and I were on call in back-to-back weeks. And we had some network reliability problems. And there was a number of issues that were going on at the same time. One, we had a customer that was being doused on a pretty regular basis. We also had an environment that was close to running out of capacity. And we also had some NIC misconfigurations that was really reducing the capacity of that environment. And to fix that, we were going to have to reboot all of our control and compute nodes, which we weren't real happy about doing to our customers. And lastly, we had a network upgrade that was ongoing and was behind schedule. And we knew was going to fix a lot of these problems, but we weren't quite there yet. The real issue that we run into, though, is that when we had these capacity issues, our control nodes would get overloaded. And sometimes they would crash. Sometimes they'd just become non-functional. And really highlighted a big problem that we had, which was that when these nodes go down, we'd lose networking for a third of our customers because of the virtual routers were hosted there. So everybody has nodes fail. It's something everybody has to deal with. But the failure mode in this particular circumstance was really unacceptable. We needed to figure out what are our options for working around this or other things that we can do to kind of alleviate this issue. So we started working on all of the problems because we knew that we had more than one. But specifically, one of the solutions we were looking at is, are there other network architecture or neutron deployment options that would make more sense that would help us alleviate this issue? And moving to dedicated network nodes was kind of one of the things that we were most interested in looking at and seeing how that might help us with the problems that we were having. So if you go look at the OpenStack reference architecture, you'll see a diagram that looks a lot like this, probably this diagram. And what this shows is dedicated network nodes. So what this means for us, if we were to make that transition, is that we had moved all of those virtual routers that we had onto dedicated hardware. One of the concerns we had with this is we weren't sure how much the load on our control nodes was related to the virtual routers that were hosted there versus the other API services that were hosted on the same machine. So Sean did some testing, and he's going to talk about that in a little bit. But the short version of that is that we weren't seeing a whole lot of impact from both a CPU and a RAM standpoint whenever we were doing our testing, even if we were pushing gigabits of traffic through these boxes. And so that kind of raised the question of, what's the utilization on these boxes going to be like? And we were kind of wondering, even if we set up three dedicated network nodes, how much are we really going to be using these things? And the problem with three network nodes that we saw was that we still have this problem with failure group size. We're going to have the same failure mode that we have already, where if one box dies, we would lose networking for a third of our boxes. So it seemed like going to more than three would pretty much be a requirement to get any improvement in that particular aspect. So even if we went to five or 10, the concern there was that we'd be wasting valuable resources that we could be used for customers in other ways. So dedicated network nodes still seemed like the leading option, but mostly because we were having a hard time coming up with other ideas. So we got together to kind of brainstorm some ideas and kind of figure out, well, what are our options? What are the pros and cons of all these? What would detailed implementation actually look like? And one of the things we started wondering is maybe we could collocate this with some other service instead of control node services. And that would give us the ability to spread these routers around, reduce our failure group size, and also put less in load on each individual node that we would be putting that on. And so the dumb idea was, what if we put this on compute nodes? That has the big advantage that we have a lot of compute nodes. That allows us to spread things around. But it also seemed like a good candidate because it doesn't make compute nodes any more important than they already are. Whenever we have a compute node failure, we always have customer impact. And we always have to treat that in a very serious manner. So that seemed like maybe something that would work out. So we started wondering, though, why haven't we heard of anybody else doing this? There must be something wrong with this idea. So we started talking to other team members. We talked to some other operators. And we talked to a couple of neutron developers. And we got some varied feedback, things we should think about. But the reaction was mostly the same as ours. This seems like it should probably work, but we've never heard of anybody trying this. So we started coming up with a more detailed analysis. And, hey, what would the packet flows look like and things along those lines? And Sean's going to go through what we came up with as far as the pros and cons of this approach. Yeah, thanks, Clayton. Before we get into the gory details of different packet flows and all our options that we considered, let's be really clear. We have particular server and networking at Time Warner that alleviates a lot of design decisions. We have a lot of bandwidth, both coming out of our servers and on our physical network. And we have very, very beefy servers with lots of RAM and lots of CPU. So that kind of changes our design point decision making as well. And as Clayton mentioned, we use VXLAN and OVS inside of Neutron. And let's get back to some of the testing that we did. We wanted to know the impact of what exactly does a virtual router take up as far as CPU, as far as RAM and other server resources. And so we tested both a single router and a 50 router scenario. And in both of those cases, what we found is that it's essentially negligible impact to CPU, RAM, and overall server load from a Linux perspective to have virtual routers. What you would expect out of that though is that we can consume every bit of bandwidth going in and out of the nick. I was a little bit surprised at the outcome of the testing based on OVS. I expected OVS to take a lot more resource than it actually does for this virtual router scenario. So as Clayton mentioned, we had this crazy idea, but we did consider a lot of different options. And fundamentally, they come down into four different kind of ideas. One is traditional network nodes. And you can go two ways with this. You can have really beefy servers with tons of networking and very few of them. Or you can horizontally scale out and have cheaper servers spread out your failure domain. The second option is to have DVR. We were a little bit, but still this requires network nodes at this point. And so we'd still have to come back to a solution for that. And then third is our idea, which we're terming VRD just to make up new acronyms. Which is traditional virtual routers. This is legacy or HA that are distributed amongst the compute nodes. Let's be really clear, this isn't anything groundbreaking. This is kind of a novel implementation of a reference architecture. And there's also solutions away from mainline neutron like contrail, plumb grid, et cetera. But we'd like to stay with mainline neutron unless it's not providing our customers with what they need. And it's not providing us with scalability and reliability, pardon me. So within the realm of mainline neutron we really have network nodes, DVR, and this VRD concept. And let's discuss these in more detail. Clayton had a similar slide up earlier but this is essentially network nodes. However you implement it. Essentially you have your L3 agent and your metadata agent that exist on dedicated servers at this point. But why would you pick network nodes? Well, first of all it's a reference architecture so there's a lot of documentation on it. It's easier to support, it's easy to scale and to think about it. And the biggest reason that we could come up with is if you have a lot of east-west traffic, BMDVM, but not a lot of north-south traffic out to the internet, it's probably an ideal case for these network nodes. But let's talk about a little bit about our different options with network nodes. As I mentioned before, you can have gigantic servers, lots of bandwidth coming out of them. But as we found out these servers are generally idle except for bandwidth usage. In addition you have pretty large failure domains. If you're running hundreds and hundreds of routers the impact of one of these servers failing is pretty high. And there's also a cost to rebuilding those. So once those servers boot back up there's a long time and a lot of overhead to rebuild all those routers and reconstruct the flows in OVS. This is generally getting better. Pardon me. This is generally getting better release after release but it's still a cost of doing these. And we also considered scaling, getting cheaper nodes and scaling them horizontally. But again, it's largely a waste of resource and there was some operational overhead for us to carve off a new type of server and to make that function in our automation. The next option we considered is DVR. You still require network nodes at that point but they have a lot less responsibility. They're only used when you're not using a floating IP and you have to get out to the internet via external gateway. And this yet doesn't quite have HA built in so we're still back for part of our traffic flows to our original problem. So it's not really solving in that respect. And the funny thing about DVR is that functionally you still have to have, you have the L3 agents placed on the compute nodes which is exactly what we were proposing but underneath the mechanism in which the flows are, put forth are fundamentally different. And then at Time Warner, we were concerned about DVR's readiness for production, its ability to scale to what we needed to and require a ton of operational tooling changes for us and a large retrofit underneath the hood so it wasn't really a great solution for us. But still there's a lot of commonalities so I'd like to quickly walk through some stylized packet flows. Kind of have four in this case. One is east-west traffic between VMs, this is illustrated in orange. This is actually the same path that it'll take in traditional legacy routers and in our implementation of that. The second is the fundamental difference between DVR and what we ended up implementing and that's illustrated in purple. It's a north-south flow with floating IPs and again, purple is the chief difference between the legacy and HA routers, what we were using, what we are using and DVR. Remember you have some specialty cases that are illustrated in blue where you could co-locate the virtual routers and VMs. This is what we're proposing and what we're doing and also where VMs need to talk to VMs within the same hypervisor host so I just wanted to illustrate that. And now back to our dumb idea. Again, there's so many commonalities at a high level between what we're proposing and what we're using and DVR that we just made up an acronym. And the biggest thing here is that L3 agent and Nova Compute are on the same nodes, physical nodes. You have that with DVR as well but your virtual routers are co-located in the same space as VMs. And some of you may be wondering about the potential of VM bandwidth, VM networking, conflicting with virtual router networking. Now keep that in mind, Clayton will come back that in a little bit. But let's, if you can remember the previous packet flow, there's really four packet flows in this case and I'm really showing three because two are fundamentally the same. The orange flow is VM to VM and that's exactly the same between DVR and legacy and HA routers. North to south with a floating IP, this is the biggest difference between DVR and legacy and HA routers because all the packets have to go to a virtual router and be knatted out of the virtual router and go out to the internet. And then DVR, the NAT is handled straight on the compute node with no virtual router in between. Commonality between, for North-South traffic between DVR and this VRD solution is where an external gateway is required. Both DVR and legacy routers require the virtual router when you have an external gateway. So that's a commonality. And again, we have some specialty cases that I highlight simply because we had some test scenarios that we wanted to test a VM to VM bandwidth where it's on the same hypervisor and what were the impacts of having a virtual router and a VM with high traffic flows in between them on the same physical hardware. So when we started working on this presentation, we had a placeholder in here and for implementation and automation and this is about as far as it got because to be honest with you, the implementation's pretty straightforward. For us, for the most part, this was telling Puppet that our compute nodes should have L3 agents on them. Of course, we realized afterwards we also needed to have the metadata agent on those boxes but we couldn't boot any instances after that. That's depending on how you handle metadata, maybe a problem you'll run into. And but past that, that was the actual implementation of this is really pretty straightforward. We kind of thought that it was gonna be more involved to be honest with you. That being said, that's not to say that we haven't run into problems along the way. There have been a couple of issues. If you're interested, if you think this is an approach you might be wanna pursue, there are a couple of things you should be aware of and so I wanna talk through that real quick. So Launchpad bug 1498844, this is the biggest problem that we've run into. And the issue here is that the L3 agent talks to Neutron server to provision routers and things along those lines. And all of those queries are handled in a single thread and Neutron server. So this is fixed in Mataka. The backboard stalled is actually not accurate this March last week, I should have updated that. But what this does is the fix allows it to handle queries from L3 agent and other plugins in all of the threads that Neutron is running. And what this means is you go from a single thread to tens of threads, depending on how you've configured Neutron server. So we anticipate that this is going to largely alleviate this problem, but we don't have the fix in our environment yet. One thing to note about this, this bug was originally reported with DVR and it's really a generic running lots of L3 agents problem. So if we had pursued DVR we would have run into the same issue. So the way that this problem shows up is we do monitoring for the size of our queues in RabbitMQ and what will happen is that basically the queue that L3 agent sends all of this messages to Neutron server starts to grow, it gets bigger and really sometimes it would recover, but a lot of times it wouldn't, it would just continue to grow and we would have to manually intervene. And what's happening there is that the L3 agents are sending messages trying to ask Neutron server for things and that one thread on the Neutron server is just falling behind, it's trying really hard but it's not catching up. So we saw this problem with over 100 L3 agents just with those agents sitting there mostly idle. So there was a little bit going on in the environment but this was mostly the L3 agent coming up to Neutron server and saying, hey, if anybody does an agent list, let them know that I'm still here. So the work around for us for this was to not deploy L3 agents on all compute nodes. So we picked 20 nodes to put L3 agents on and we felt like that was a good trade off between not running into this problem but also giving us the failure group size that we were looking for. So now we're in a situation where if we do lose one of those compute nodes that has routers on it, we'd lose roughly 5% of those and that's a lot more manageable than a third. The next issue that we ran into is the first time we did a deploy that changed in the L3 agent configuration. And so that restarted the L3 agent. Whenever we do our deploys to compute nodes, we do 40 nodes at a time. And that meant that all of these nodes restarted in a pretty short period of time. When those L3 agents start up, they go to Neutron Server and they say hi, can I please have a list of all the routers that I should be responsible for? And Neutron Server gets inundated by all of these guys coming up and asking for these things. And it's working really hard, it's trying to get all this data out of the database. But what we found is that with the number of L3 agents, not really that many to be honest with you, it frequently couldn't respond in a time period that the L3 agent found acceptable. So what would happen is that all these L3 agents would come in, request their status and they would have gone away and said, hey, look, I'll check back later by the time, by the time Neutron Server could answer. And so in our environment, we saw the L3 agents never recover from the situation on their own, it required manual intervention. And what we ended up doing when we did run into this is we would shut down the L3 agents on all of the nodes affected and we would bring them up in a couple at a time. And that was pretty workable, but this clearly not a long-term solution. So what we've done as a workaround for the time being is, as I mentioned before, we normally do 40 compute hosts at a time and what we've done is we've taken these hosts that have routers on them and we've moved them into a separate deployment group where we only do two at a time. And with that approach, we haven't run into any issues with L3 agent restarts. It's not ideal, but it's a pretty workable solution for the time being and hopefully once we have the fix for this bug in our environment, it's a fix that we'll be able to take back out. So as I mentioned before, the easiest part was actually figuring out, hey, how do we put routers on compute nodes? The bigger problems have been more around operational complexity. These aren't insurmountable, but it is a little bit more work. The biggest thing here is that it's one more thing to check whenever a node fails and now that we have routers on these things. And that kind of leads to the next point was that we also had to have tooling changes. So when we have a compute node fail, we have tooling that assists us in notifying customers that would be affected. So that tooling had to be updated so that we could notify customers whose routers would have been on that compute node. And so not a big deal, but something that had to be done. Another example is we have tooling for evacuating a compute host of all customer workloads before we would do any maintenance on it. And that had to be updated to move the routers off in that environment. One of the things that is involved here that makes this a little bit more complex also is that we didn't say just compute nodes one through 20 are the ones that are gonna host routers. That would have been nice. It would have made it a lot easier to figure out what's going on whenever you go to troubleshoot these things. But because we already had existing compute nodes in place, it ended up being a little bit more sparse than that because we took into account rack topology and network topology whenever we place those routers to make sure that we wouldn't lose a rack and lose all routers at the same time. As Sean mentioned before, there is some monitoring stuff that had to be updated. Most of that was pretty minor. It was making sure that checks run in the right place and things along those lines. But one of the things that we've not addressed yet is the capacity management aspect of this. So we do have monitoring in place around capacity management, but this does make the situation more complex because we're mixing that north-south traffic from the virtual routers and the east-west traffic for tenants. It's a little bit more complex to figure out, hey, why is this thing running into capacity problems? So thankfully, we're in a situation where we've recently upgraded our network environment. And so we have a lot of capacity. This is not like a pressing problem, but it is something we're aware of and something that we're going to need to address in the future. So just to kind of wrap up, I want to talk about kind of what our plans are in the near future and further on. The number one item right now is that we want to move to HA routers. We upgraded Neutron to Liberty a couple of weeks ago and HA routers is on the short list of things that we were looking for in Liberty. We've been doing some testing on this and we expect that whenever we get back from the summit, it's something that we're going to start working on getting rolled out. Part of that is also, we've been looking at doing custom router scheduling. So we want to be able to do rack and network topology of aware HA router placement. And as part of that, we're doing a custom plugin to be able to key that off of information from Nova about host aggregates and things along those lines. Another thing we've been talking about is also the possibility of taking into account resource utilization on those boxes. So for example, if we had a customer that had an instance that had a flavor that had no network limits on it or very high limits, we might not put routers on that node or vice versa. So that's another thing that we're taking into account. Another thing that we have been thinking about is that initially our plan was to put routers on all compute nodes and we've backed off of that. And as I mentioned, that's led to more operational complexity, but once this bug is fixed and we have the fix, the question is, do we go back to putting them on all compute nodes? And it seems like there's not really a clear answer to that at this point. What we have worked relatively well, we've already made the changes in our tooling to be able to work around these problems. So I think that's something that decision that we'll have to make in the future. Lastly, the question is, do we use DVR at some point? There's a lot of good things to be said about the DVR approach. I think the biggest problem for us now is maturity and HA router support. As Sean mentioned before, you still need an L3 agent that's gonna host that some of that traffic and it doesn't support HA routers for that portion today. There was a lot of work that happened in the Mataka cycle and I think that it's scheduled to be mostly finished up in Newton. We're keeping an eye on this. One of the things we're a little afraid of and I think Sean alluded to earlier is that there is a big change in the operational model with DVR and so that's a lot of training for troubleshooting and things along those lines that we already have that we're gonna have to rework for DVR. But that seems like where we probably would like to be long term. So if you wanna get in touch with Sean or I, any of these mechanisms are a good way but I think we've got some time for questions if anybody has any. And if you do have a question, you should come to the mic apparently. The race is on. What kind of latency do you see with these virtual routers? Latency. Like latency added between communication of the VM to VM or egress from VMs or ingress to VMs going through a virtual router. So it relates to the change that we made. It's the same. The latency that we saw in both architectures is fairly minimal. I mean, there is some overhead because you may have some extra hops involved but it's all over a 10 gigaethanet. So it's not really significant. Is it in like microseconds difference or? Low millit of micro, yeah. Yes. Yeah, I noticed, can't help but notice that you mentioned that you knew about solutions like Contro or Nuage from ALU that solved these problems you're describing like three years ago. I noticed you said you wanted to stay in the mainline neutral on as long as possible. At what point though, where you're solving problems that have already been solved, does your time, the value of your time start making you lean maybe towards one of those solutions instead so that you don't have to roll all this yourself? We ask a similar question every day. Okay. No, we've got things always under review. Really, how bad is the problem that we're experiencing? And how big is the retrofit cycle? It's super painful, that's almost a non-starter for us. Yeah, so at this point, we're not doing a greenfield deployment so the migration cost is going to be significant for moving to anything along those lines and so that's definitely part of the discussion. Understood, okay, thanks. So from a security perspective, did you have any considerations to put in place for exporting the compute nodes outside? I missed the last part of the question. So the compute nodes had to be exposed to the internet, right? So there were no network topology changes that needed to be mated, so basically all the network topology was already set up so that this could happen because we needed the connectivity for these nodes to be able to talk to the control nodes already, so this was really purely software changes from our standpoint, so there wasn't, I don't think there was really any additional security concerns. We had already gone through that cycle of analysis because we support external networks for our customers. So we were already able to drop traffic on an external VLAN from the compute nodes anyway. Yeah, maybe the big concern I have about DVR is it makes the compute nodes expose the internet, so I wanted to make sure that something like that is taken into consideration as a design of DVR already or because if you have the network node outside, then you can only care about protecting the network node. Right. But if you have the compute nodes exposed outside, now we have a bigger footprint to protect it. Right, and we have that concern as well. A lot of our traffic, we own the network out to the internet as well, so we do have some DOS attacks that happen. Fundamentally, we haven't seen too many things that dropping VMs directly on that VLAN that's exposed to the internet has happened, but we are looking into it and we do have some bigger mitigations in place outside of OpenStack though, on the edge. Thank you. So does your approach mean it requires a public IP address for compute nodes? So you require an IP address per router, but we already had that requirement. We already had public IPs for management sort of functions on those boxes already, but this didn't change our IP address usage because we already had, we're basically just moving the routers. There was no additional IP addresses that were required and that is an issue that some people raise with DVR as it stands today is that it does require more IP addresses. Thank you. Any other questions? Well, thank you everybody. Thank you.