 All right, so we are going to be talking about managing open V-switch in a large heterogeneous environment. I am not Andy Hill, that is not Joel Priest. Use the slide and you can figure out the rest. We are both systems engineers at Rackspace who have been working on the public cloud for about three years now in total. So some quick definitions first. What do we mean by large fleet? Tens of thousands of hosts, yeah. Lots of hosts, obviously way more instances on top of that which equates to lots of logical ports and everything that you are managing with open V-switch. Hundreds of thousands of instances to be exact. Heterogeneous, tons of different hardware manufacturers, more than is probably saying. We have several majors in server versions, differing kernel versions within those and servers, multiple hypervisor vendors within those, sorry, hardware vendors for our hypervisors within those different hardware revisions there of all of those kind of tweak the requirements and behavior slightly that you get when you are working with anything really and by extension open V-switch. We have six production public clouds stretching from DFW or IAD, US, London, Hong Kong, Sydney, et cetera, et cetera. Six internal private clouds. So that would be we run OpenStack on OpenStack. So that's our internal OpenStack deploy that we refer to as Inova as well as any of our pre-production environments, CICD, Dev, all that kind of stuff. So this is a quick OBS introduction on sort of the flow of open V-switch and where it's going. I'm just familiar with that. Essentially you potentially have a control and a management cluster that is talking down to your OBS which in turn has a local database that it has reads and writes going to and from that both to the OBS process and to the control and that all exists in user space and then there is the open V-switch kernel module existing below that. So the main takeaway as far as our presentation which you need to know from this slide is basically user space, not as fast. Kernel space, super fast. So you want to be operating as much down on the kernel space as you can to get the most performance out of V-switch. History. Rack space. We have used OBS since, I don't know, it was probably just V-S at the time. We've been on open V-switch for a very long time, pre-1.0 I believe when we launched our next-gen OpenStack cloud we were running 1.3.1.4 of OBS around there. Been using it before that. We were using it before OpenStack even launched. Some towards the tail end of the old old legacy cloud, slice host days if anybody remembers that. We were using OBS. Power is 100% of the next-gen cloud and we have upgraded OBS nine times in the last two years. So this is my favorite slide. If you get nothing else from this talk, if you are a package maintainer, if you are using Open V-switch, upgrade OBS. We are on very new OBS, like, as new as we can get for the most part, try to, yeah, if you're, you'll see and you'll go into extensive excruciating detail on that in a moment. Like you, yeah, you want to be on newer versions of OBS. We've talked to people who have not gotten expected, like, what they would consider expected performance out of it and a lot of times when we dig deep on that we find out that they are using way old versions. So please, please, please upgrade your OBS. So why upgrade OBS in the first place, right? These are the summary of the reasons that we've upgraded OBS. The main ones around performance, we also did some upgrades to make upgrading less painful. They've made a lot of improvements into how the in place upgrade of OBS happens and makes it more palatable for us to do very often, like we have done in the past two years. We also upgraded OBS because we needed to upgrade our NSX controllers. So anyone who's ever upgraded NSX, one of the components of that is to upgrade open V-switch. So we had to do that a couple of times. There was also a really nasty regression in OBS 2.1 that made us upgrade once more. This one was around certain types of VMs whenever they would plug the first time they'd be okay. But then if the tenant reboots the instance, it would no longer get data path flows. So we've upgraded OBS for a few other reasons, but the main driving reason behind all of this is performance. So some of the things that can impact your performance are some of the things that we've seen impact performance for public cloud use case. Largely around broadcast domain sizing and specific attention for broadcast related flows. Just take some real care and make sure that that's behaving the way that you want it to behave if you're crafting these flows. A lot of times what ends up happening is the broadcast traffic goes to every node, goes to every VIF, and instead of OBS dropping the flows dropping the ones that it doesn't need to, it just goes all the way to each VIF anyway instead of having a little more control over that where you don't have to spend the CPU cycles to even consider that traffic that's not going to be relevant to your host. So we monitor OBS quite a bit. This is just a big, big chart of OBS CPU, OBS vSwitch DCPU utilization. This was in a cell that had one of those large broadcast domains that I'm talking about, but then somebody started doing something a little interesting, and we see this. So OBS CPU went through the roof because of all this broadcast traffic going to all the nodes. Some of them was basically port scanning and doing, generating a lot of this traffic that ends up being particularly painful for OBS. This was kind of in the olden days, and that's a picture of Chad. He's one of our systems engineer. And this was just a really, really bad situation. So we had to kind of take a look at what we may have done as an operator and kind of examine the flows themselves, but vSwitch could have handled this bad situation a better way anyway. So around performance, I'm going to go ahead and just say there's three kind of eras that we've experienced with performance. Pre 1.11, I'll call it the dark ages, it's really rough. You don't get performance on a lot of workloads. Just the typical workloads sometimes have problems. Megaflows, and then I'll say ludicrous speed. I'll get into each one of these. Let's go ahead and do the dark ages. So how many people in here have seen this flow eviction threshold setting and had to tune it because there's a workload that's unhappy? So by default, pre 1.11, flow eviction threshold can cause some problems. So flow eviction threshold is the threshold in which OBS attempts to start evicting unused data path flows. So Joel mentioned earlier, the data path flows are the fast path. But if OBS is kind of managing that and trying to keep those to a minimal, you may have some issues where a lot of those data path flows get generated. Then OBS has been CPU cycles managing those data path flows, and it just turns over and over and over again, causing that CPU usage to go way up. It was also single threaded. This didn't get changed until a later version of OBS. It also was doing some very specific matching on the traffic that comes in. So it's like a 12 point match, source port, destination port, all the little details in an open flow connection. It was very, very specific matches. So some workloads that may require a brand new connection from a client to a server that gets a brand new source port on every single connection. Well, that's another flow. That's another thing that OBS has to keep up with. That's one of these situations where OBS really had a tough time. And there's a couple of other things a little lower. But basically, the more bridges that you had, you also had a penalty for that. The more bridges you had to traverse, you kind of also got hit for that. So then OBS 111 happened, and we got mega flows. And we felt really good about it. We were like, all right, mega flows. There's been a whole lot of activity in the community about just wait, mega flows is really going to help you out. And really, for the most part, you were less likely to hit that flow eviction threshold that I mentioned earlier because instead of doing a very specific 12 point match, OBS could do some wild carding on those flows. And therefore keep the number of flows that were in Datapath much smaller so the performance was much better. But there were still some workloads out there. I mean, 2,000 Datapath flows for some workloads is still pretty well nothing. It's trivial. So we still had some issues, but we saw really, really significant improvements from pre-111 to 111. So this is a chart of the average Datapath flows for one of our regions, you can see they're kind of hovering up there. And then the dip on the right is post 111. We cut almost all of our cases cut in half, right? We're very, very happy with this, but we still could do better. We still could have, we still had some cases where this just wasn't good enough. So then we have ludicrous speed. If you were watching Spaceballs in the lunch area earlier, they did away with flow eviction threshold completely with 2.1 and beyond. So now instead of a 2,000 Datapath flow sort of limit, there's 200,000 Datapath flows by default out the gate with OBS 2.1 and beyond. And this is a configurable value. And we can run all the tests that we want. We can do lots of synthetic workloads, yada, yada, yada. But for the wild, for what we see with public cloud, we've seen over 72,000 Datapath flows with some workloads in 260,000 packets per second. We're really, that's a significant improvement before 2.1. We've been really, really happy with it. This is a chart of the top five Datapath flows post 2.1, just to give you an idea of most workloads are towards the bottom here. This is just a top five, but we've got one guy who's kind of out of control in the 40,000 Datapath flow range. This is a smoke paint chart, if any of you guys are familiar with smoke paint, right? The colors on the chart, the non-green colors on the chart, those straight jitter in the connection. It's pretty clear when we upgraded this environment from 1.4 to 2.1. It was very significant performance. Our infrastructure team, this was for an internal cloud, but our infrastructure team loves this. We have much more consistent performance on the network because of this upgrade. So, mission accomplished, right? We moved the bottleneck. OBS isn't our bottleneck anymore from what we've seen on the networking side. The new bottleneck that we have for tenant workloads is more around the Zen Netback, Netfront driver, which has some significant improvements coming soon. And then there's the guest OS kernel tuning. Some people may have very specific workloads that they need to tune their kernel a little bit, so those are really the things that we're seeing. Just giving you an idea, after we went to 2.1, we have seen no escalations come to us, no problem cases that we were able to go back and say OBS was the problem after going to 2.1. It's all... We haven't had a single escalation since... Let's see, we completed all the upgrades for production in July, I think, and we haven't had a single case that we've been able to go all the way back to OBS, and that's a problem, so that was a huge deal for us. So, I'm going to bang the drum a little bit more here. Please upgrade. So, I hope there's some people who distribute OBS in here. If you do, please upgrade so that your customers who are using package OBS from you can really take advantage of this. Please upgrade, please. And by the way, OBS 2.3 is long-term support, so if there are concerns around that. All right, so now that we've talked about upgrades and the benefits of upgrades, how do you do it? Well, we orchestrated the whole process with Ansible. One of the things that's kind of unique about upgrading OpenVswitch is you're losing the network connection to your host, so there can be a little bit of concern around this, especially in some cases if you're doing a large amount of host at a single time. So, I'll just say take a look at Ansible's async module if you're going to attempt to do this with Ansible. Watch your SSH time-outs in particular from experience in a pre-production environment. I may or may not have taken down an entire cabinet of machines because I wasn't paying close attention to that. And around the impact of OBS upgrades for bonded configurations, we see under 30-second data plane impact, so we do interrupt our customers, but it's a brief interruption and for our non-bonded customers, it's under five seconds. There's still some discussion and fixes going on around why the bonded configurations are taking longer and there's a link in the slide to that, I think. Yeah, so, but all you really have to do after that is just get the RPMs to your host and execute forced Kmod reload and you're all done, right? It's not quite that simple. So when we did our first upgrade, we had something really strange happen. After the upgrade happened, hypervisors, full hypervisors, were just rebooting kind of at random in one environment and this was pretty concerning. It comes back to the OBS bridge fail mode. So by default, the bridge fail mode is a normal fail mode. You can set it to secure fail mode, the difference between normal and secure bridge fail mode. Normal fail mode, whenever traffic hits the host, it acts as a normal L2 switch. Secure fail mode, it's going to drop the traffic. So some really interesting scenarios can happen because of this normal fail mode, one of which was a Zen server bug that we found where the traffic got broadcast to all of the vifts but we were doing some network sharing across like iSCSI and provider network and the iSCSI traffic hit a vif. That shouldn't have happened and it causes a Zen server bug to basically cause a hypervisor page fault and reboot. So it's really, really tough. And you can't just say, I want to switch from normal to secure fail mode. It's not quite that simple. If you do that, it's a data plane impacting event. All the flows on the bridge get cleared. So that's another kind of wrinkle in this whole process. And then yet another one would be the fail modes don't always persist like you think they would. You have to kind of change a setting in Zen server anyway to have that fail mode persist across reboot. Oh, and some more issues around fail modes. So we do use patch ports for kind of moving our traffic along to the appropriate bridge that needs to go to. But if you have some misconfigured patch ports and you have a normal fail mode, this was an incident we had where we had a routing loop because of that. So around this whole theme we really, we had to secure all of our bridges. We had to go to the secure fail mode. Otherwise we couldn't upgrade OBS across the rest of the fleet. And the patch ports don't persist across reboot and there's no construct within Zen servers to make them persist across reboot. So we had to kind of enagle a crime task to happen on reboot to do that. So now that we've determined that we have to migrate from this normal fail mode to a secure fail mode, here's the high level steps that we took to upgrade. We just created another new bridge, had this similar configuration to the old one that all the vifts were plugged onto. Then we moved each vif to the new bridge which got them their new flows. And this was a loss of just a few packets. After that we actually performed the OBS upgrade itself. And then we had to do that ensuring bridge fail mode thing on reboot. And then we cleaned everything up. We may have had some monitoring kind of flip out because interfaces changed according to the kernel whenever we reloaded the OBS K mod. So we have to like restart SNMP just so our monitoring system doesn't lose its mind. And we did this entire process with Ansible. It was a wonderful tool to use to do this kind of step-by-step process. And I can't really think about using another tool to accomplish such a thing. Okay, so there's yet another got you around upgrading OBS and it's around kernel modules. So this isn't a hundred percent of the time but we don't really risk it. Basically if you upgrade your operating systems kernel through normal patching or whatever like that you also need to make sure that you have a matching OBS kernel module to go with that. There are situations, especially if you're doing these big upgrades where you go from like 1.4 to 2.1 where the mismatching kernel modules if you don't have the matching OBS kernel module with your new kernel you're not gonna be able to have networking on the hypervisor. So this is another one of those things that we learned the hard way. Basically if you're upgrading OBS make sure you have the kernel module and the way we do it is we don't upgrade the kernel module or we don't upgrade the operating system kernel without upgrading OBS kernel module and we don't upgrade OBS without ensuring that we're not missing a part of this equation. So yeah, kernel upgrade equals OBS upgrade is basically what we've taken away from it and since we have such a wildly varying environment this means we have a lot more complexity to manage in terms of getting those kernel modules making sure they're delivered to the right place and if you don't pay close attention to this in your SSH timeouts like I mentioned earlier you can have a trip to the out-of-band management system and try and get networking back on that hypervisor. Okay, and there's still some other challenges around OBS in upgrading, there's a lot of reasons maybe it's maybe it's not your whoever packaged OBS for you maybe it's an organizational thing you can't upgrade OBS right now. This is a problem that a lot of people are facing and we've realized that not everyone can just put the gas puddle down and upgrade OBS as frequently as we do. Other people within Rackspace also had issues with VLAN splinters and the OBS VLAN bug work around. This is not really a problem with OBS so much as it is a problem with some NICs but this is really well documented in the OBS project itself. And I mentioned it earlier but this was a real thorn in my side was some of the components of OBS weren't really tightly integrated with the hypervisor so we really wanted to have those patch ports just be there on reboot instead of having to hack something up via cron. And just kind of a quick summary of the platforms that we've run OBS on. We have some OBS managing LXC, KVM and several versions of Sensor. All right so now that Andy talked to you about all the really difficult stuff that he has to deal with I'm gonna talk to you about measuring OBS and monitoring it. So essentially he's doing all the heavy lifting and I'm doing the stuff going hey, Andy is broke, do something about it. So before I get into the measuring part just out of curiosity how many people in here are actually using OBS like right now? How many of you are in the dark ages? So pre-111, oh thank you. And 111, at least 111. Two plus, at least two plus, yay. So I'm glad we are preaching to the choir on like half of our presentation was just begging you to upgrade when you're already there so that's good. So measuring OBS, I would say in fact all the graphite slides you saw earlier that data was generated from a script that Andy and one of our other co-workers Jason Kolker worked on called Pavlovs, Pavlovs, get it? And that basically is pretty straightforward Python script that's just sending data on packet counts and CPU utilization and all that opening a socket to graphite, dumping it in there and then you can aggregate it there. This is awesome, it gives you a really easy way to look at your entire fleet and see what your OBS is doing and see things like we showed you earlier where you can see that dramatic drop in flow counts and CPU utilization with every upgrade. So if you're not doing something like this I would recommend it especially before if you were one of the people who are still in the dark ages or 111 and thinking about coming up do it before that if you're not already because it's just super fun to basically get literally instant gratification the second you're finished reloading Kmod it's like, oh hey look the network is just screaming now, so that's awesome. We aggregate these, we use cells so we aggregate this from the region cell hypervisor so you can get that whatever granular view you're going for and it's really useful for DDoS protection if you're managing hypervisors like with us we have a dedicated backbone team and they're monitoring that stuff at the ingress but it's really useful to be able to see from our side anyways, it's super you just look at who are your top users and if somebody all of a sudden spiked and their orders of magnitude more than other customers there's a good chance they're a DDoS target but that being said when you're at a scale as large as ours we're having problems with scaling graphite and stats D to accommodate because this is a ton of traffic every multiple multiple data set data points for every single port on every single hypervisor or every single instance so we're running some interesting challenges with that in some of our larger regions. And so OBS and compute host lifecycle this script is another script we have internally off-sulate basically this is what runs when we provision a new hypervisor that checks it into our MVP NSX controller evolved a little bit over time back in say when we were in the dark ages rain of things it was just a standalone Python script that we were executing that was kind of dumb that's evolved now it's pretty much strictly enanceable it checks the host end this is kind of almost on the monitoring side of things cause for the most part you're only gonna check something in once if it fails that means that that hypervisor there's only two reasons really it should fail that hypervisor is having a communication issue to your control cluster which is bad and you need to address what is breaking the communication between those two or if say you're re-kicking a hypervisor like it had a hardware issue you failed it out replace that and now you're re-bootstrapping it fresh it has a new host certificate and it gets a conflict when you check into your controller especially if you're using a consistent node name and all that management IP that it has in the NSX controller so you have to catch that error condition and then work around that to either update the host certificate separately or delete the node and then re-add it from scratch so what are things that we monitor these are the big ones we use Nagios for our monitoring and most of these are passive SNMP checks so the first ones we do is connectivity to the controller so just simple the OBS VS Cattle commands in my do I have any controllers that I am not connected to that should always be no the main cause for that for us that can happen after the fact after something has been successfully deployed is generally routes we're connecting to the controller via an overlay network and so we have to have routes directing that traffic to the controllers correctly and thus from the controllers back so if something were to happen that made those routes get out of whack or what not we need to address that SDN integration process so for us running as a server this is the OBS SAPI sync process the TLDR on that is it's basically just the process that is reporting up to the controller the status of the different OBS parts on the hypervisor so it knows where to plug in the little bits that it's in charge of there's some interesting challenges for us specifically on that because we're basically using a fork version of it to handle some internal rack space sort of Byzantine logic stuff on our SDN side so we're not using the stock so whenever we do these upgrades for OBS we have to check and make sure that our custom version wasn't stomped by the upgrade so part of our upgrade process is to write in our correct one and make sure that's the running one so we have our check is just a simple bash script checking A is this process running is it sim link to ours and does that file that the sim link is pointing to actually exist to make sure that it's got all the little bits that we're expecting there and routes that kind of goes back in not just to talking to the controllers but routes from say cell to cell so that hypervisors can talk to other hypervisors and this is gonna bridge look at our bridge networking job look at our setup so as you can see here the blue line is the direct line between two hypervisors that's the one you will end up taking the majority of the time but on your first connection you gotta go down to the control cluster node the orange line and it establishes that tunnel at first so if you break the blue line or that orange line you're gonna have a bad time you've broken your tenants isolated network or your OBS connected activity between those two nodes so that's why we're monitoring for routes and we're starting to do stuff now that not just monitoring we have some rec space projects that have been brought up in a few of the rec space talks here auditor and resolver which are along in the same vein as the entropy project and we're I think having some discussions with those people about that about kind of not duplicating work there but essentially we have some checks put in where if a host for whatever reason does alert for not having the correct routes that it needs then resolver can go and run a job on it and say hey run the task that we have the Ansible task that we have that fixes all the routes on this and fix this connectivity so hopefully that doesn't come up ever I mean ideally but when it does we've got tasks to react to it for us. One of the other reasons this is kind of an issue for us is we have a brownfield IPv4 deployment so we don't really have the luxury to say this is all the tunnels are going to be established on this very large network we have to do kind of right sizing and we have to pay close attention to our IPv4 deployment it'd be really nice if our tunnel endpoints could just be v6. Yeah and reboots as Andy mentioned with the mismatch kernel version when you reboot ideally when the host comes back up you would like it to still have networking I don't think I should have to explain that too much but that's the goal that we're going for and if you go to the link family that goes to your blog post on the blog post from Andy a little bit more detail about the kernel mismatches and stuff but you would think this would never be like a big widespread thing it's not like anything's ever going to happen that we're going to reboot the entire oh god if anybody in here is a rack space customer you got this email very soon this was from the reboot apocalypse as I tried to get people to call it but I don't think anybody joined in other than me so this was the Zen server bug that went around this is why Amazon had to reboot a non-trivial portion of their cloud and we had to reboot the entirety of our next gen cloud so yeah that's a big deal if you're rebooting literally every hypervisor in your fleet which again tens of thousands of hypervisors you want them all to have networking when they come back up so we were monitoring for the kernel mismatches already so we would have known about any of that it could have potentially been a problem on but when we did our orchestration for all that which again Ansible if you can't tell we're big Ansible fans we put in pre-flight checks for that so it would hit every hypervisor and just double check and say hey, do I have the kernel versions I'm going to need if I don't, we failed that host out immediately from that reboot playbook so there would be no chance that it would get rebooted and then we would revisit it and rectify whatever that was causing that problem so I think that pretty much covers it anybody have any questions or anything? The default Zen server use open with switch bridges for management interfaces, sound connection all connection of the host do you use that bridge in your installations? Can you say that one more time please? The default Zen server pass all traffic including management and sound connection through OBS bridge do you do this? So not the same bridge but yes we use OBS bridges for all of the networks so in dark ages it was most fun part when everything died including management interface there was a lot of fun back then Yes, the issue that we mentioned with the version chrono mismatch it would yes break the bridge that was our management interface That's why we would get hit by that is because OBS was managing that network that's why we had to do the out of band to fix that so that's why it was so critical for us to monitor for that aggressively if we had a 1% failure on those reboots of tens of thousands of nodes that's hundreds of out of band Java console serial overland console we have to log in and fix we had no interest in doing that So yeah From your presentation I got you using the out of tree OBS module so do you since which kernel version would you recommend using the in tree OBS module or do you recommend that at all? We don't use the out of tree kernel modules we the only kind of out of band things that we do are around the OBS ZepiSync process and we use the kernels and we compile those are compiled for the kernels that are shipping with Zen server So the kernel ship with Zen server doesn't include the OBS module because in recent kernel versions main mainline kernel versions OBS is included So this is actually one of the problems that we have with the packaging of Zen server itself is it's packaging in the patches that you download whenever you download Zen server patches is it's packaging OBS 1.4 and we're at a substantially higher version than that and we can't run some of those patches because of the dependency there So the Zen server we compile the OBS 2.1 kernel module against the 2.6 whatever Zen server kernel Oh okay, so you're running a quite old kernel version Sorry? You're running a quite old... The one that we get with Zen server is... Yeah Here Hey, so last week there has been a news release about Terak Space dropping open v-switch support in favor of Linux bridge for their private cloud efforts So could you comment on that? Yeah, I think it comes down to we're doing tenant networks and for a lot of the build-outs they're not using tenant networks so they don't need to have a SDN controller to manage their networking, really Some of it is also around the additional complexity that we have to take on in order to accomplish some of our tenant networking and some of the provider security that we have in place some of the kind of value adds that we do with our hybrid cloud offering So we're using open v-switch from public cloud and I don't really see that going away Okay, thanks Anybody else? Nope Alright Thanks for coming guys