 Well, welcome. Guess I'm on here. Wow, I actually didn't expect such a full room for the last section. So thanks for sticking around. My name's Jack McCann. I'm the tech lead for Neutron, an HP's public cloud. And I'm going to tell you a little bit of tale of how we got to where we're at today with Neutron. Before I get started, I just wanted to ask, how many people have deployed OpenStack Neutron? Wow, OK, wow. How about Nova Network before that? OK, I don't know why I'm here. You're all experts, right? So I want to start with a little background on HP public cloud and what we've been doing over the last few years. We started about three years ago. We made a strategic decision to go on to OpenStack for a public cloud deployment. At that time, we didn't have any options. We went with Nova Network for our first gen computer service. This was in the Diablo time frame. And around, maybe it was August, the multi-host feature showed up in Nova. And we thought, that's what we want. So we adopted the multi-host feature. And there were a couple of things with multi-host that we thought we'd like to do it a little differently. We wound up with a few extensions to that model. Were we able to run the Gateway IP address, the same Gateway IP across all our compute servers, the same Gateway Mac, and a couple other modifications? So that's what I mean with HP extensions. And that model worked very well for us. We got that deployed in 2011. In early 2012, we took a look forward to see what we were going to do going forward. And there were some features that we didn't get with Nova. We had a flat shared network for our tenants. And we really wanted to get on to pertinent private networks, bring your own IP addressing type of things. And so we started looking at quantum at that time. So over the last two years, we've been working with quantum, standing it up in our test environments. And about a year ago, we went into public beta with quantum neutron, switched over to neutron. Forgive me if I say quantum. So we got into public beta with neutron about a year ago. And really started to hammer it with load. And we ran into a lot of issues. It was a bumpy road, the latter half of 2013. And what I want to talk about today is that those bumps that we hit in the road and what we did about them and what we've contributed back to the community to try to make neutron better. So a little background, the model we run in our public cloud, this is our tenant facing network model. So it's pertinent routers with private networks. And some folks have asked along the way, why this model? And it was basically this was because you can deploy neutron in multiple ways. And this was really the only model that gave us all the features that we needed, the isolated private networks, overlapping IPs. And then we had inherited other functional requirements from our Nova stand up around security groups, Nova metadata, floating IPs and DHCP. So this was our tenant facing deployment model. And if you come on public cloud as a tenant, you'll get your own router, your own networks, your own subnets. So our deployment model, I'll get a little bit into now how we actually deploy the service. And we use a standard server building lock. We've got a few flavors of servers that we run over time. But they're HP servers. The basic server building block, we run x86, dual socket eight core hyperthreaded, so 32 CPUs presented to the operating system. Most of them have 128 gig of RAM, so we generally don't have any memory issues. Local storage varies. Two terabytes is pretty standard for the non-storage oriented servers. We run a pair of 10 gigabit NICs in a bond. And what we do with these NICs is each server in the rack, one of the NICs runs up to one top rack switch, one of the NICs runs up to another. And we use HP networking gear in our network fabric. 5,900 series switches. And one of the nice features in those switches is we can run them in this intelligent, resilient framework. Basically, it lets us cluster the switches so they present as one down to the server. So the server sees one switch at the other end of that link aggregate, so we can lose a tour and still have connectivity to our servers. So that's the basic building block. Now, what do we do with it? So the heart of the deployment is a neutron server. In our deployment, you guys see this OK? Font's a little small. The heart of the deployment is the API server. So the neutron server, we use a pair of those server building blocks behind a load balancer. And we have a dedicated API access network for our tenants to come in onto the neutron endpoint. We take another pair of those servers plus a quorum node clustered to run our database. We run MySQL. Another pair of servers for RabbitMQ, again with a quorum node clustered for high availability. And if you missed it, there was a talk on Tuesday by a couple of my colleagues that went into some detail around how we run our HA clusters. So you could look back and take in that talk if you managed to miss it. Compute servers, so many, many compute servers, I think our largest deployments now are about 1,000 compute servers in a single neutron control plane. And finally, because we didn't really have any other choice, the dreaded network node. So we run several servers to host network service functions. The DSP agent, the L3 agent metadata. We do run these in a clustered configuration for high availability. We've got some failover scripts that run. If one of the nodes fails, those networks and routers get failed over to the remaining network nodes for HA. Those servers are connected out to the internet for the external network for floating IPs, and the router static SNAT addresses, pretty standard stuff. Underneath these servers, we've got a management network where the RPC database and other management traffic runs. And finally, we do use an overlay network. I'll say a little bit about that. And we break that traffic out onto its own VLAN. So this is essentially our basic deployment model using those building block servers. Now, this particular slide, you'll notice there's a couple of colors on here. There's some green, and there's some blue. And in the interest of full disclosure, we are running our own plugin. And I'm going to ask you to kind of set that aside for a little bit, maybe suspend disbelief for a minute. Because the green here represents code that we run straight from upstream neutron. Basically, 85% of the code we run is straight from upstream. And if you've looked at the open source plugins, open vSwitch Linux bridge, you know that in the plugins, it's basically usually a pretty thin wrapper around a chunk of common neutron code. Our plugin is similar. It's a thin wrapper around some common stuff. So the blue is kind of meant to represent the HP specific code, and it's a small layer. And we also have some HP stuff in the L2 agent. And you could think of that stuff as analogous to, say, the OBS or Linux bridge plugins with VXLAN, L2POP, and an R-Presponder function. That's basically the functionality that's living down there. Why aren't we using OBS or Linux bridge with all that stuff, basically because it didn't exist two years ago when we started down this path? I would use it today, though, if I was going again. And we are actually moving in that direction with our new HP Healing on OpenStack release. But I'm not here to talk about the plugin. I'm here to talk about the other 85% of the code. And the ratio of problems we ran into was roughly in that proportion, 85% in the common stuff, 15% in the proprietary code. So the neutron server, the heart of your deployment, you've got to get this right. It's got to be provisioned right and tuned properly. Or if this is not running right, if it's clogged up, the rest of the system starts to behave in really interesting ways, starts to fail in interesting ways. I mentioned that we run a pair of these neutron servers behind a load balancer. So we've got two of those very capable servers I described earlier behind a load balancer. Handling API requests from users, RPC traffic from the agents down to compute nodes and the network nodes, updates to the database servers, calling out to Keystone. And now handling API requests from Nova as well. What we found, one of the very first things we ran into as we started to load up this system, was a single process neutron server as a bottleneck. So we've got 32 CPUs presented up to the operating system. One of them's pegged and the others are twiddling their thumbs. What are we going to do about this? We kind of expected this problem because we'd run into it two years earlier with Nova when we first stood Nova up. And we did the same answer that we did for Nova. We added support for multiple worker processes. These links here, these are hyperlinks to the upstream patches that we've contributed back in Icehouse. So now in Icehouse, you can go in your neutron.conf file and you can tune the number API worker processes and the number of RPC worker processes. When you do that, you should be able to see on your server that load will spread out and it eliminates that single process bottleneck. And when we did this, it really, we had a rash of issues when we had only a single process API server. And as soon as we deployed this in our test environments a year ago, that whole rash of problems went away. And a whole new rash of problems cropped up because the problems moved. I will caveat. There is a caveat on RPC workers. It's currently tagged as experimental. It does not work well with QPID and 0MQ, I think. It has issues with, but with Rabin MQ it works pretty well. And you really want to not be bottlenecked on your RPC processing in your Neutron server. So Neutron server, heart of the system pumping strong, going well. Now we're ready to fire up some VMs. Our tenant Sally here is going to help us today through our presentation as our tenant Sally. And the first thing Sally wants is an IP address. We'd like to make our tenants happy, but Sally doesn't look happy. So we debugged all sorts of these throughout the year last year. I can't ping my VM. How many debug? I can't ping my VM. OK, you feel my pain. I feel yours. So symptoms, VM console log, the dreaded route info failed message in our VM console log. Pretty typical indication that the VM just didn't get its IP address from the DHCP server. So we had encountered various issues with DHCP server over the course of last year. And this is a series of the fixes that we had to make to the DHCP server to make it run better. And these are the actual merges that we've pushed upstream. So some of the problems that we ran into. Well, the first one is just having a default route. So if you're using the default DNS service from DNS Mask, it can actually go talk out to the internet and be a proper caching server. And then several that are fixed in Icehouse here. We ran into interesting cases. Well, the first one was DHCP agent sits in a loop trying to synchronize its state with the server. And we found some cases where it would try to configure up a network. And something would go wrong. And there was probably some lower level bug that was blocking that network configuration. But that would fail. And it would kick out the sync state loop. And the agent would stop processing the network. So it was a case of one bad apple spoiling a bunch. And it would come around again and try to sync up again and hit the bad network. So this was causing us issues with new networks and updates. So we fixed that. We ran into a couple issues with DHCP agent cache getting out of sync. Some interesting corner cases. There was this merge to keep DNS Mask. DNS Mask thought it was still holding onto the lease. And the DHCP agent thought the address was freed up. So we made some changes to help synchronize that. We had some cases where that sync state loop was taking so long to complete a new sync state would start up. And that caused some issues as well. So we addressed that. This was a good one. Noticed in the log files and the API server log files that were thousands and thousands of these calls from the DHCP agent to get DHCP port every day. Just filling up our log files. And it turned out that just the way the agent was coded, it was doing to get DHCP port call when it didn't really need to. It just needed to do that once and remember it. And all of these combined to fix up some corner cases, make it a little bit more efficient. DHCP agents getting a little healthier. This last one is one we found recently when we want to move a network from one DHCP agent to another either in a failover situation or if we're evacuating the node for maintenance. What we found was when we remove it from one DHCP agent, the old port gets freed. And when it comes onto the new agent, that new agent grabs a new port with a new IP address. So all your VMs that are sitting back there using that default DNS mask as a DNS server on the old IP address suddenly have no DNS until they refresh their lease. So we've got to change this one's in review. If there's any cores in the audience, nobody wants to raise their hand. We've got to change up in review. Hopefully this will land shortly in Juneau where if you do move a network from one agent to another, it'll keep the same port or attempt to keep the same port to address that problem. So DHCP agent getting healthier. VMs got an IP address. Now it wants to get some metadata. Let's see if Sally's happy. Sally's not happy yet. Anybody debug? I can't SSH? Yeah, OK. Again, the dreaded giving up on metadata message in the VM console log, that's the typical symptom here. And we would see this just clouding it trying to connect out to the metadata server and it would time out and give up, in this case, after about three minutes. So we turned our attention to the metadata agent. And for those of you that aren't familiar with how that flows, basically a VM, when it wants to get its metadata from Nova, it'll connect to a metadata proxy process that runs in the router namespace. That proxy process will in turn connect via Unix socket to the metadata agent. And that metadata agent will then go and talk to the Nova metadata server to get the VMs metadata. And you can have, be launching multiple VMs in parallel, like some of our QA team likes to do, like 100 VMs, 300 VMs. And you can have other tenants that are launching their VMs and trying to get their metadata connecting through. Anybody seeing a bottleneck here? So what we ran into, again, we ran into actually two things. The first thing was there was a single process metadata agent, a lot like the API server. It's just one process, too much work. The other interesting thing we saw was the default listen size on the agent's listen socket was too small. So we were seeing connections getting dropped. So we pushed a change, a pair of changes. Again, I think these landed in Ice House to be able to spawn multiple metadata workers and to configure that listen socket to have a higher backlog. So you can change in your metadata agent INI file. There's now these two configurable parameters to address that problem. Now, the other potential bottleneck here is your Nova metadata server. But I'm going to punt that one off to the Nova guys. You'll have to go Nova talk about that. That does need to be properly tuned to make this work. So unclog that bottleneck. It's got our IP, got our metadata. Is Sally happy? No, she's still not happy. Yes. So this graph here, what this graph shows is the blue bar shows. This is from some of our internal monitoring software. What the blue bar shows is the time that a VM takes to become active. The green shows the additional time it takes to become SSH-able. And the yellow indicates when the VM failed to become SSH-able within a specific period of time. So you can see here that things were running along fine. And all of a sudden, we had a gap where the VMs didn't become SSH-able. And then they picked up again. And we saw this happen. These intermittent reachability issues. And we eventually traced this back to a restart of our L3 agent. So we take a look at the L3 agent. And this was an interesting one as well. So some interesting times with the L3 agent. First one, a restart. Every time we restarted the L3 agent, it would go and tear down all around our namespaces, all the floating IPs, and plumb them all back up. Yellow lines on our graph all over the place. So that got fixed in Icehouse. After that, we were still seeing some outages on our restart. You know, your restart, you want to upgrade the software, fix the bug, tune something. And we found a case where during the restart, the agent's trying to go make sure everything's plumbed up the way it's supposed to be. And when it got to the floating IPs, it would actually delete them and then add them back in. So there was outages, temporary outages there, and we got past that. One of the other factors contributing to the size of that gap was when we do configure a floating IP, add a floating IP onto a router. We do configure to send gratuitous ARPs out. And we've got our set to three, so we'll send out three gratuitous ARPs. So it's an ARPing with a count of three. That takes three seconds. And if you're running in the L3 agent and you come along and you want to add a floating IP and do an ARPing, all of a sudden, okay, I'm going to stick around for three seconds while the ARPing runs, and then I'll get along to the next floating IP. So it was slowing things down, particularly when it had a lot of work to do, a lot of floating IPs to add. As might be the case in a failover or a migration. And so we made a change to put that, spawn that into a thread, and that landed in an ice house and unclogged that bit of bottleneck. And then this last one, this was a pretty good one too. The agent, these are some things that we've done during ice house. There's been a lot of work in the community to improve these things as well. I should mention that. One of the improvements was the ability for the L3 agent when it's restarting, it goes out and it wants to know, okay, what are all my routers? It's got to re-synchronize its state. And there was a change done to do that in a loop and do that in parallel so you could process multiple routers in parallel. The thing we were in into with this was as you're processing these routers in parallel, they're adding floating IPs into the namespaces, doing IP tables commands to set up the NAT rules, and there's an IP tables lock to serialize access around the IP tables update because it's not atomic. And what we saw was all these updates, so we're trying to happen in parallel, got serialized behind the IP table lock. So one of our guys said, well, why don't we just use a separate IP tables lock per namespace because they are actually independent and you can update this namespace IP tables and this one at the same time. So we've got to change upstream and ice house to fix that by using independent IP tables locks. And the last one here, and this took a while, was during that restart when it's trying to sync up its state, there's new stuff, new work coming along, new routers, new floating IPs, maybe some going away, but it wasn't processing that new work during the restart until it finished synchronizing. So one of our guys, Carl somewhere, I don't know if Carl's here, but has a change in review, any course? Course, change in review to try to fix this so that while it's processing, resynchronizing, if there's new work that comes along, that'll get some priority because probably most of the existing stuff is already there and it's just checking to make sure it's still there. So a lot of improvements in the L3 agent, very nice house, these and others from the community. Let's check back in with Sally. Sally happy? It's always the network. You notice the problems are getting harder too? So Sally's not happy, the network's slow. So kernel version matters. We ran into several issues along the way, particularly as we started to scale up the load on those network nodes. We had a situation where if you try to delete a namespace, kernel would panic. And you really, by the way, if you don't have namespace deletion turned on, you really want to delete namespaces. You don't want those things just kind of piling up and it's not a good thing to have those laying around as we found. V8 pairs, if you got V8 pairs in your configuration, we found those to be a throughput bottleneck in earlier versions of the kernel. We're trying to figure out, I think we couldn't push more than three gigabits through one. And if we took it out, we could push the full wire rate of nine or 10 gigabits. Issues with network namespaces the number of network namespaces on the servers increased. The overall system performance slowed down. And finally, another one we ran into was the contract table in the queue router namespace. We found occasions where it would fill up with these unreplied entries. Now the good news is most of the problems we ran into had already been fixed in later kernels. So the namespace delete, panic, null pointer, dereference, I forget the details of that one. The V8 performance fix, I think that was around using per CPU statistics. Network namespace performance, you really want that one by the way. If you've got a lot of namespaces. It was a problem coming out of each time the agent issued a command into the network namespace as a process into the namespace. And on exit, a process coming out of the namespace. And the kernel would hold a VFS mount lock as it was coming out. And that was a global lock. And if you're running a lot of these with a lot of namespaces, it took a while to clear out of the namespace. So the whole system would slow down. And finally, the contract table full, what we found was by default, a contract will install entries for inflate connections with a default time out of five days. One straight packet, five days filling up your contract table. This fix reduces that to five seconds timeout. That's a good one to have too. I talked a little bit about that namespace issue. So commands, commands can help. This particular issue we saw as the number of interfaces on the system and large number of network namespaces, SUDU would slow down. And almost everything those agents are doing, every command they run, they run SUDU. And on a smaller system, you don't really notice it, but when you get say a thousand network interfaces and hundreds of network namespaces, the slowdown was considerable. And what we found, one of our guys found was that SUDU when it starts up, it goes in a numerates old network interfaces on the system. Doesn't do anything with it, just. So there's actually an option in version 1.8.10 of SUDU to disable that. In SUDU.com, few set this probe interfaces to false. And you can see the change, the difference it made here. You know, without the fix, just to time SUDU sleep once, that should take one second. It was taken one and a third seconds to sleep for one second. Most of that being system time. Afterwards, not so much. That's a good one to have. So, check in with Sally. Is Sally happy, anybody wanna guess? Oh, Sally, Sally, Sally. So those of you that might have attended the sessions earlier this week, IPv6 is not quite there. Hoping to make a lot of progress on that during Juneau, the guys at Comcast are really driving that forward. We've got some guys at HP helping out with that. Sally's gonna have to wait a little longer on the IPv6 addressing though. How many folks trying to run IPv6 and their deployments? Well, obviously it doesn't work very well, but. Okay. If it worked well, how many people wouldn't wanna be running it? Okay, sort of like Neutron, right? Oh, did I really say that? Okay. So, Sally, Sally hopefully is about as happy as we can make her right now. I do wanna say a little bit, looking forward to Juneau and one of the gaps from Nova Network to Neutron that I'm sure a lot of people in this room are aware of is it wasn't really an alternative or a analog to the multi-host model. So, when we started down, this just looks stepping back a little bit. We started down the path to public cloud and HP. We looked around internally in HP at some different technology and one of the pieces of technology that we found out at HP Labs was something called Diverter Distributed Virtual Router. So that's an interesting thing. By the way, that's a published paper. That's a hyperlink to the paper if anybody's interested. So that's something we might wanna use in cloud. Fast forward a couple of years. I talked about our first-generation compute service on Nova Network. Multi-host was a long way, along a big step towards DVR. We had some extensions we did to get it to use the same IP and Mac across all the compute servers. So in our V1 cloud, no matter what compute server your VM landed on, it always saw the same gateway IP and Mac. Fast forward a couple of more years and we've got now, there's been a couple of starts on multi-host for Neutron over the last couple of years. What we did last fall was we took a team of HP engineers and dedicated them to making that feature available on Neutron. So there was some talk of that in the design summit session yesterday. That looks pretty well on track for Neutron. That's the blueprint link right there. We really hope that's gonna land in Neutron, sorry, land in Juneau. For those of you not familiar with DVR, a quick word about that. Essentially, each compute node provides the routing services for its local VMs. In our subnet traffic, east-west will flow instead of having to go to the network node and back down, it'll flow directly between the relevant compute nodes. Floating IP traffic from the external network instead of going through a network node, we'll go straight to the computer server. There'll still probably be a network node out there for default SNAT, perhaps DHCP, some other services that can't be distributed, but this will go a long way towards getting us back towards a sort of model we had with Nova Network Multi-Host, which had some really nice fault isolation and scaling properties. You lose a compute node, you lose those floating IPs, but you've lost the VMs. So you can see here, in the virtual world, you've got Router A with a couple of VMs, and in the physical world, Router A follows VM1 down to the first compute node, and it follows the VM2 off to the third compute node, and the same thing happens with Router B. So we're really looking forward to that in Juneau, and I think that'll help out a lot. Oh, Sally's happy. Thank God. So summary and conclusions. Upgrade Neutron, if you haven't tried it lately, Ice House is better than Havana, is better than Grizzly. Neutron Server, the heart of your deployment, make sure it's properly provisioned and tuned. If you're using metadata, you might want to take a look at tuning the metadata agent, upgrade your kernel, grab that pseudo. There's probably other commands too that I forgot to mention. I think there's an IP route version that you need to delete namespaces, and I think we should see improved performance scalability in Juneau. So I'm gonna stop there and take questions. I live for it. No questions. Hey, well listen, thank you all for staying. Oh, is he leaving or has he got a question? I wonder, with all the problems you had with metadata of service and DHCP, why you didn't consider using config drive with CloudInnet? So a good question. Why use DHCP and metadata at all? A little bit of that's buried in a little bit of the history. We stood up our first cloud offering with a metadata service. We wanted that to be available to our second gen VMs. There is the option to do config drive. File injection we're not so fond of. We didn't necessarily wanna be mucking around with our tenants VMs and our cloud. Yeah, we went from file injection into using config drive. Config drive working well. CloudInnet, yeah, it's been trouble free so far. We may wind up having to go there for IPv6. Thank you. Thank you, good question. Anybody else? Hi, according to HP experience in the public clouds, how many VMs is suitable for a single neutral server to handle? So the question is how many VMs can a single neutron server handle? Yes, exactly. So that's a good question. I actually don't have the answer to that. I can tell you that we're currently running on the order of multiple thousands of VMs, multiple thousands of networks and routers on that configuration. We're not at capacity. Database servers not breaking a sweat, rabid MQs, twirling in his thumbs. There's plenty of capacity left. Okay, thank you. Thank you. Okay, thanks everybody. Thank you for staying. Thank you.