 All right, everybody, thanks for coming. This is Dev. Oops. Lessons learned from a cloud network architect. I'm James Denton. I'm a principal architect for Rackspace. My name is Jonathan Almalley. I work with James. I am a network architect. Perfect. So after the presentation, we're going to throw the slides up on SlideShare. If you want to grab the link, don't feel like you have to take notes or anything. Big pictures of slides. We'll have them all for you. Yeah. Great. Rackspace, we are a hosting provider, right? One of the earliest contributors and really creators of OpenStack Project. As architects and engineers on the Rackspace private cloud team, we support a wide range of OpenStack clouds from Folsom to Newton. So we've seen a lot. I've been on the team since 2013, been working since 2012. Johnny joined me in 2014. So what we're here to talk about is as we've worked on these clouds and implemented various configurations throughout the year, some of the lessons that we've learned and some observations that we'd like to share. So as James mentioned, we've got hundreds of clouds that we support. All of them varying in size, complexity, scale, purpose, and each one is unique. So when we work with our customers to design the cloud and to figure out the proper configurations for that cloud, be it a actual whiteboard session or just James and I talking, we're taking into consideration the community options that are available, the customer scale, the performance require of that customer, and again, they're all going to be different. Most of the time, we're right on. Most of the time, everything goes as planned, but there are some occasions when things don't go as we intended, when we get hit in the mouth, so to speak. And it's up to James and me and our team to kind of roll with the punches and figure out how we're going to move forward, get past this particular issue without having to repave the entire network, without having to repave the entire cloud. So what we're not here to do is start a flame ore for various technologies that are available to you with OpenStack. We don't want to necessarily advocate one method or another. What works for one type of cloud may not work for another. But knowing what your options are and where the limits are of particular options is very important. And you're going to see in one case, like James mentioned, we're not here to discourage or to say don't use one method over the other. There's going to be a case where we start with A and we move a customer to B and another example, we'll start with B and we'll move the customer to A. So we do go back and forth. Right. It doesn't matter what's good for this particular cloud. Yeah. And as Johnny mentioned earlier, you start out with a set of requirements when you're building a cloud. And over time, those requirements change. As more users start adopting your cloud, we have found that their requirements may not have fit in the initial design. And so you have to either work around and tack on features to your existing cloud or maybe you have to repay, but that's not always an option for folks. The first thing we want to talk about is the battle of the switches, right? Open B switch versus Linux bridge. Like I said, we're not here to start a flame war. Both are perfectly valid and production ready switching technologies. Let me go back a little bit and talk about our journey to Neutron. So as I mentioned, I've been involved since 2012 and then Nova Network was what you had and Nova Network utilized Linux bridges to provide connectivity to instances. When we adopted Quantum and then Neutron, what you had there was this community-driven idea to use Open B switch rather than Linux bridges. One of the lessons that we learned was that community documentation really helps drive adoption. So at the time, especially all the documentation guided folks towards Open B switch, and that's kind of the direction that we went. One of the appealing features of Neutron at the time was tenant-managed networking. We got to leverage overlay networking technologies, which was primarily GRE at the time, later VXLAN. And we knew that the community was moving in the direction of Neutron. Nova Network was going to be deprecated eventually, but nobody knew exactly how long it was going to be. I think everybody's very optimistic at the time. It would be, you know, six months to a year, but I think as we found out, you know, there are still folks using Nova Network probably not doing deployments, but definitely feeling the pain of having held on to that technology for a little bit longer than they needed to. So some of the issues that we encountered in that initial phase of our private cloud was the immaturity of Open B switch at the time and the immaturity of Neutron itself, right? So with Open B switch, we initially, version one, especially, witnessed a lot of packet loss, packet corruption, especially when there was a lot of traffic being pushed through a box. Enhancements to Open B switch included the use of mega flows versus micro flows or wild card flows. We experienced some kernel panics. And as far as the agents are concerned, you know, we ended up submitting a lot of bugs to fix issues with the agents because maybe they didn't manipulate the flows they needed. Or in this case, if you were to restart Open B switch, the Neutron agent wouldn't program new flows in time and you ended up with bridging loops, which in a physical environment is bad enough, or virtual environment, it's just compounded even more. So at the bottom, we have a couple of bugs that we opened that have since been resolved. Open B switch is an awesome technology, two or three years more mature than it used to be, and same can be said for Neutron and its Open B switch agents as well. So at the time, a couple of things prohibited us from upgrading to OBS, right? Kernel incompatibilities with new Open B switch versions, DKMS to compile kernel modules, a reluctance to believe that that version of Open B switch was going to be the one that solved all of our problems. At the time, you saw a lot of transition in the community going from the core plugin technology to a, trying to think of the term here. Anyway, to ML2, right? So at the time you had OBS or you had LinuxBridge, with ML2 you had the possibility of both. And what ML2 provided us was the ability to easily leverage the same database schema and change the core plugin behind it, right? So for this customer in particular, running Open B switch and having a lot of problems, we decided to first upgrade from Grizzly to Havana so that we could leverage that new ML2 technology. At that point, we had a consistent database schema and we could more or less change the underlying plugin, update a few fields in the database, pull a few virtual interfaces out of their bridges, restart the agents, and all of a sudden your LinuxBridge agent was now building out the virtual networking infrastructure using LinuxBridges. Where that's important is that at the time, your other alternative may have been to redeploy a new cloud leveraging LinuxBridge. That's real painful for users, right? Nobody wants to have to snapshot instances and migrate them to a new cloud. Maybe they are also feeling that the instability of the cloud really makes them shy away from OpenStack in general. And those are things we wanted to avoid. So where we went to from that was making LinuxBridge's standard driver for our RPC release based on IceHouse. And ever since the IceHouse release, LinuxBridge continues to be our standard driver, even though OBS is still an option in upstream OpenStack Ansible and will soon be adopted. As I mentioned, OpenVswitch and Neutron has continued to mature, so a lot of the issues that we had in the early days aren't really issues anymore. Now some of the features that the Neutron community has released over the last couple of release cycles, things like Tap as a Service or OVN, Distributed Virtual Routing, those are features that you're only going to get with OpenVswitch. So if those features are important to you, then by all means, stick with that deployment strategy. But for us, we focus on keeping things fairly simple and don't yet require those features. So LinuxBridge is a good option for us. The next example we're going to give is a change in the way that we performed our tenant network segmentation, meaning segmenting traffic for multiple tenants or even different networks within the same tenant. The two options that we are going to look at today are VXLAN and VLAN, VXLAN being the standard overlay technology that allows tenant networks to be encapsulated in a shared network across the physical hosts. Standard VLAN being traditional networking 802.1 Q tags. So here are a couple of examples or key points of each before we go into our transition from one to the next. VXLAN started off as our standard for network segmentation. The VXLAN is the overlay technology used between hosts. Again, it runs over standard UDP. So it's very familiar with most administrators. It allows for over 16 million unique segmentation IDs. So 16 million VNIs, which in essence is unlimited to the user. So the user can spin up, tear down, create as many networks as they want. They've got 16 million of them. VLAN, however, is also an option. It's a little more traditional. You're limited to the standard 4096. Segmentation IDs likely a lot lower due to limitations of spanning tree on your physical infrastructure. So the 4096 is probably more like a 500. So again, VXLAN was our standard. So we deployed this cloud with the standard VXLAN. And the first thing we noticed was that due to the overhead of VXLAN, SSH stopped working. Actually never worked. So we could ping the instance once it was spun up. We could tell it to port 22. But the SSH handshake failed every time. So our solution was simply to lower the MTU on the instance. SSH passed. Now at that point, we've got an instance with a 1450 MTU, which is fine for the most part. Other considerations were to actually address that at the infrastructure level by raising the MTU across all the hosts, across all the nicks, across all the switch ports. But we worked with that one. Do you want to go into LGPot? Yeah, sure. So one of the other dependencies of using VXLAN as an overlay technology is the need for some sort of mechanism to program the forwarding tables on all of the hosts involved for the community. There were two options. And we'll get into those in more detail later on. But our default option was to use the L2 Population Mechanism Driver to allow Neutron to program all of that stuff dynamically across the hosts. What we found, especially in the early days, Juno was the, or sorry, Icehouse was the first release where the L2 Pop Driver was available. Even up through Liberty, we've experienced issues with, as a cloud, scales in terms of networks, reports, specifically, you might find that the agents on nodes are slow to update their bridge forwarding tables and ARP tables. And what that translates to is the inability for an instance to maybe get its IP address from DHCP, or especially the inability for that instance to be able to communicate with its gateway or other hosts in the environment. How this manifests itself is a customer saying that, hey, I can reach my instance, but I can't hit some other host in this network, or I can console to my instance and I can hit that host, but I can't hit the internet. And what you'll find when you start troubleshooting this is that the bridge forwarding table on your network node is missing the MAC address and the VTEP address, and maybe even the ARP for that instance. Now, that's something that won't eventually show up, but your first instinct is to think that something's wrong, you start restarting agents, and before you know it, things start compounding and you're having a few hour troubleshooting stuff. And also interferes with workflow. I mean, if you spin up an instance and it takes 10 minutes for the instance to get its IP address, that impacts production. Absolutely. So troubleshooting became an issue because we have to make sure that the database and the ARP table are correct on both sides. Excited? Yep. Another issue was simply with performance. This was a few years ago, and the NICs that we were using showed a significant decrease on a 10 gig network down to roughly three and a half gig using VXLAN compared to just about line speed at about 9.8 gigabits per second on 10 gig network using VLAN. Huge performance increase. We performed similar tests on different NICs, newer hardware that had the VXLAN offloading enabled, and we got closer to VLAN. I was thinking 9, 2, 9, 3, maybe 9, 5. Yeah, close enough to make it worthwhile, right? Where we find this common is when users or deployers repurpose existing hardware. You'll find that a lot of the hardware they're repurposing for this cloud, while it makes sense to try and leverage that and to the fullest extent, some of the hardware in those devices is admittedly old, especially when it comes to network cards. The Intel X520, 540 are very common cards that are out there in the wild. They're based on the 82599 chip set, and it works with, you know, supports SIRLV and some of these other networking technologies, but can't handle VXLAN offloading. And so that's where you start to see some of this performance issues. So taking into consideration troubleshooting, and this is where the fun begins. Let me go to the next slide. So as James mentioned, there needs to be some form of population for both the forwarding database on the, sorry, the bridge forwarding database as well as the R table. We saw bugs, not necessarily with Neutron, but with Linux kernel that made troubleshooting near impossible. We give one example here where we have the flooding entry, and the flooding entry is akin to a broadcast. So if a entry cannot be found for a particular MAC address, go ahead and send it to everyone in the network, any host that has a member that requires that particular VXLAN network. So we're supposed to see one flooding entry per host per VXLAN network. So in this case, we have VXLAN 15, and we have a particular host with all zeros for its MAC address. We're supposed to see one for that host. We were seeing over 28,000 of them. If we would reboot the host, we'd go ahead and we'd see the proper number. And usually it's the same number of nodes in your infrastructure that require that entry. So in this case, it was 55. We thought everything was good. We'd go to bed, come home the next morning, or come back to work the next morning. That number was at 1.5 million. All right, and it wasn't consistent either, right? So we submitted a bug to upstream Neutron, provided a bunch of data, and come to find out it wasn't a Neutron agent issue. It was an actual kernel issue. Meanwhile, this impacts the Linux bridge agent up through a few releases because the agents sometimes perform screen scraping after performing certain actions. If this action to get those 28,000 entries takes 15 seconds to occur rather than half a second, that's going to impact that agent's operation, resulting in possible downtime for the user. The user doesn't care that there's 1.5 million entries and it takes a lot longer. They care about the time to reach their instance. So we had a decision to make. And the decision was to take them off of VXLAN and try to put them on standard VLAN networks without impacting their production, or with minimal impact to their production. So here's the process that we took. So we know that there's 16 million possible VNIs with VXLAN. This customer's obviously not using 16 million, but they were using a fair amount. We had to at least provision that in many VLANs, plus enough maybe for future growth. So we tried to do a one-to-one alignment between VXLAN that we had and VLANs that we needed. Set the default tenant type to VLAN, as opposed to VXLAN, so as they create new networks, VLANs are added, not VXLAN networks. And then another database hack, which wasn't as difficult as you thought, because if I can hack the database, any of y'all can. Modify the database, restart some of the agents, and watch the magic happen. So here's an example of one of those networks. On the left, you can see a standard VXLAN network with V99, notice the segment, sorry, the VLAN or network ID is CB3. After the conversion, the same network is now a provider network type of VLAN with segmentation ID 1109. The best part about it, the instances had no idea that it happened. Yeah, and these may seem like drastic steps, right? Nobody wants to have to get into the Neutron database to make changes, restart agents, and develop that work, that process. The reason why we have to take those steps is because there is no API hook to perform this sort of modification, right? The idea is that you create a network, you use it, if you don't need it, you destroy it, but you can't go from one provider network type to another, right? The ML2 driver and the database schema really enabled us to make this change really quickly. And we're talking three seconds of downtime as the Linux project agent rebuilt these interfaces. So moving forward, having learned the lessons and being burned by early VXLAN use, that VLAN tenant networks may still have a place after all. I know that everybody, the emphasis is cloud scale, VXLAN is going to enable that, but for some of our customers, maybe they only have five networks or 50 networks. And sure, it may be a lot of VLANs that they need to provision for this environment, but sometimes stability and consistency trumps the scalability there. So with VLAN tenant networks, there's a lot less work for the neutron agents to do because you no longer have to rely on the L2POP mechanism driver to program all of the bridge tables. And you get better performance there in the eyes of the user, right? Their instance is available much quicker and it's consistently available. What you'll see moving forward, I think, VXLAN moving away from the host altogether and being performed at the top of the rack switch. So then you're still looking at VLANs between the compute node and the top of the rack switch. And it's real important that if you do want to leverage VXLAN, which is great, that you maybe make the investment into newer hardware that can help offer better performance than some of the older gen stuff. So the next thing, Johnny? The next thing, the nursing migration that we performed was a difference between to use the L3 agent or simply not use the L3 agent, meaning are we gonna put our instances behind routers or are we gonna put them directly behind some type of physical network device? One is gonna offer us flexibility, perhaps one is gonna offer us performance. So the benefit of having the L3 agent, it gives the power to the user. Users can create routers, users can create floating IPs, they can manipulate those floating IPs however they need to. Some other key points? Sorry. No, that's fine. So when you leverage neutron routers, you're putting the hands of network creation and architecture into the hands of the user. So with tenant networks, with VXLAN tenant networks, those networks are effectively isolated. Those aren't being terminated on some physical network device. The neutron router is where you would connect to that VXLAN network to provide inbound and outbound connectivity. So if you go with VXLAN, you absolutely need to go with the L3 agent and neutron routers. Multiple tenant networks should be attached to a router. You're able to route east-west between networks behind the same router, no problem. Your inbound and outbound connectivity from an instance is handled by some sort of NAT on that neutron router typically. So source NAT for outbound traffic or floating IPs provided in and out unique address. With neutron routers, there's little to no change for the physical infrastructure. So one of the benefits of overlay networking is that it all rides on a single network. Users can create thousands of networks and you don't have to modify the physical infrastructure at all. You get overlapping subnets and you're able to leverage new technologies like VPN as a service or firewall as a service. When you talk about moving instances into provider networks which effectively use physical hardware gateways, now you're talking about having to interface with a network administrator to trunk VLANs to compute nodes or use a mechanism driver like the Arista driver or Cisco driver to program those switches. The great part about going this route is that your compute node now is the path to the physical network. You're no longer having to hit a network node or an inferno where that neutron router may live. Now a downside is potentially if you do require some sort of NAT that has to be handled upstream and chances are there's no API interaction with neutron to make that happen. So some of the issues that we ran into with this particular situation was obviously the router became the single point of failure. We had some issues with HA routers at the time. This was again a couple, at least a year ago where we were having issues with HA routers both becoming active or non becoming active. Moving over to a hardware device allowed us to utilize the HA feature of the hardware. Possibly a little more reliable at the time than HA routers. No network congestion because like James mentioned, the compute node can send traffic directly to the hardware gateway. And of course, we noticed that in case a router did fail over to a different. Or if a router was rescheduled to a different agent. Programming that router took 10, 15, 20 minutes depending on the number of floating IPs that were on that router. Just to program the IP tables that required to make those translations. And it could take minutes to restore connectivity. So one of, and I apologize, this may be a little difficult to see, but one of the tests we performed was a scale test for loading up a single neutron router with 1,000 VMs behind it and 1,000 floating IPs and performing what we call a time-to-ping test. How long did it take for that instance to be accessible via its floating IP as soon as I booted it? And you'll see that across these 1,000 VMs, boot time was pretty consistent. That instance went active between six to 10 seconds. Now you'll see that the increase of the time-to-ping is sort of linear as I increase my number of instances. And what we learned about that was that as the number of ports associated with that router, that message got larger and larger and the actions that that router had to perform when setting up the networks took longer. So if the neutron router was applying IP tables, it's not only applying the IP tables rule for that one instance, it's actually possibly reapplying all of the existing rules as sort of a, just part of its operation. Safety mechanism. Yeah, an auto-heal sort of functionality. One thing we'll mention is that drop in time there in the green box, that's a result of us finding a bug in the way the Neutron L3 agent handled the application of those IP tables rules. When the upstream community patched the bug, we immediately saw a drop and then as we worked out that increase. Chances are you're not going to have a single router with 1,000 VMs behind it, but you never know how your users are going to use the cloud, all right. Some of the other issues we experienced with L3 is that by having to require the use of NAT for inbound or outbound connectivity, sometimes that had a negative impact on applications. We have a customer that uses Windows instances and relies on WMI, just like FTP, the real IP may be required for connectivity and the use of NAT kind of broke their application. Active directory. And sometimes you have use cases where you want users to be able to reach your instance via their fixed IP externally but also require the use of NAT. So to make that work, we had to statically route tenant networks to Neutron routers and that doesn't scale well either, right. The idea is to leverage a Neutron API to do everything. When you start interfacing with external teams, you really take a lot of the flexibility of the cloud out of the equation. So again, another pretty big decision. Do we take them off L3? Are they really using the features required that L3 provides to them? Or would it be simpler to put them in a standard provider network? So we decided to ditch L3. They weren't using floating IPs the way that they needed to, they weren't really utilizing firewalls of service, VPNs of service, any of those other options. So we decided to simplify the network. The process was pretty simple. Detach the routers from the gateway, detach the routers from the tenant network and reconfigure the physical infrastructure to address what the router was doing. So here's a quick side-by-side of what the tenant networks look like before and after the switch. You can see on the left, you've got multiple, that's weird. Well, on the left, we have three Neutron routers connected to a single provider network and the VMs behind those networks and the transition was to basically move those tenant networks up to physical hardware. Again, at that point, you lose a lot of the flexibility that Neutron provides you, but for this use case, you know, the user was sticking to a single router behind a single, or a single network behind a single router and didn't really scale out. So having maybe gone with provider networks initially, gathering those requirements from the user early on could have helped avoid those problems later on. We're getting minimal downtime for this migration. It has had no idea that their gateway changed. Quick clearing of their ARP cache and everything was good. Yeah, we got it. In the future, you know, even today in Mataka, BGP speaker functionality enables us to advertise tenant networks to external gateways and not have to rely on static routing. DVR addresses the network bottleneck of the network node and allows compute nodes to forward traffic directly out. And L3HA presides that router resiliency, right? Significant increases in stability for HA routers. Yep. So last but not least, we'll just kind of talk real quickly about VXLan VTEP learning, your two options, L2Population or Multicast. As Johnny mentioned earlier, everything across the cloud needs to know how to reach the other with VXLan. Your two options, L2Pop and Multicast, both are VTEP learning processes. L2Population is a more dynamic learning, I'm sorry, a very static programming learning process because the Neutron agents, Neutron server work together to program the bridge table and the ARP table on the compute nodes. L2Pop was developed by the Neutron community, or at least in Juneau, has undergone a lot of work, but in some cases is still widely inefficient. The alternative being Multicast, which has been around for a long time. It's not managed by the Neutron agents. A lot of the learning mechanisms is handled by the data plane itself and it works through the OVS and VXLan kernel modules, depending on which one you go with. One of the downsizes is it does require some programming on the physical switches for it to operate properly. As you can see in the Multicast demonstration, traffic as it leaves one VTEP will send traffic to the Multicast address and the switch takes over and handles the forwarding of that traffic to the respective nodes. You don't see the animation, but it's there. Oh, that's a bummer. That's a bummer. With L2Population, we also have an animation and basically shows Neutron server as a port is bound to a node. Neutron server sends notifications to the appropriate agents and those agents are responsible for programming their own tables. As the number of ports in a network grows, that message for those updates gets larger and we have observed that those updates in the agents take longer and it's very linear. Some of the issues with L2Pop we mentioned earlier is that the agents are slow to build their bridge and ARP tables. How that manifests itself is, again, is the inability to reach an instance or for an instance to reach its gateway. As a result of some of these issues we've had with L2Pop, there have been numerous bugs submitted. OpenSec Ansible, the upstream community that we leverage, migrated, or changed the default from L2Pop to multicast. Now, yeah, go ahead. Now we're running multicast, thinking that this is the greatest thing since sliced bread because we don't have to worry about L2Pop anymore. The first thing that we realized is in a couple of our upgrades, the network infrastructure that it was on wasn't ready for multicast, which means everything came to a screeching halt. The physical network does require, first of all, that the hardware does support multicast and then there's some configuration involved with the IGNP snooping and the querier. Both of those need to be configured in order for multicast to properly function. So again, as a result, when we did this upgrade, nothing worked. So we fixed it. Right, how did we fix it? Back to L2Pop. Back to L2Pop. Now there's nothing wrong with it for this cloud. It wasn't gonna see the same scaling issues as it manifested itself in our earlier example. This is a much smaller cloud, a much smaller scale. So L2Pop was gonna be just fine. It required no interaction from the network guys. We were able to leverage Neutron to do all the programming for us. Yep, and again, kind of harping back on, it's not a one size fits all solution, right? Gather your requirements, determine what the best method is going to be for you, implement that, but you know that in a lot of these cases as we've shown that you can change and it doesn't necessarily require a repave. So we'll wrap this up. We've gone over a little bit, but to recap some of the lessons, if you're gonna live on the bleeding edge and adopt a technology early in its infancy, be prepared for that to be painful, especially as an operator and even more so as a user, right? These clouds, a lot of these clouds are in production, thousands of these clouds are in production. And if you're leveraging technology and infancy, you're gonna cause problems for your users and that may have a negative impact on their adoption of OpenStack in general, right? We learned that short-term pain is sometimes needed for long-term gain, right? I'm gonna contradict what I just said, but our adoption of those technologies early on gave us the operational experience to know how we wanted to move forward in the future. We feel that we understand OpenB switch a lot better now. Had we adopted it now, it's much more complex than it was even then. So it may have taken a lot longer for us to get on board. Be sure that your hardware meets your network requirements, upgrading to NuNix to leverage these technologies is probably the way to go. So you'll wanna make sure that you make investments in the hardware accordingly. And users are gonna lose faith in the product and your cloud and maybe look elsewhere, public clouds or lose faith in OpenStack in general if what you're providing doesn't provide reasonable access time, stability and consistency, right? And we've shown that by keeping things simple when you don't need the complexity, it's gonna save operational time and pain and provide a better experience. Meet the cloud more functional. Yeah. Well, that's all we've got, guys. I'm not sure we have time for questions. We went with the other group to come up, but if you wanna hit us up, we'll be up front. So. Could you figure out a second slide? Yeah, absolutely. Second slide, I think. All right, well, thank you, everyone. Thank you.