 in this slide. Excellent. So hi, everybody. My name is Rob Sherwood. Let's get settled in here. I'm going to be talking today about how to scale out neutron networks. How many people actually have an OpenStack network deployed? Anyone? And of you, how many of you are actually having problems with neutron? So what I'm going to talk about today is a whole bunch of stuff that hopefully will help you with some of those problems with neutron. So I am the CTO at Big Switch Networks. We built networking software. But networking software is only one piece of this crazy OpenStack puzzle. So what I'm going to talk about today is a large scale-out test bed that we did with our partners, Mirantis, on the OpenStack side, as well as Dell, both for the compute and these physical hardware, running the Big Switch software. And so we scaled that out to 300 nodes running neutron. And we made it work. And so today I'm going to talk about what that scale-out test bed looked like, what are some of the things we ran into, and some of the fixes that we have. And I'm last going to end with this thing that I'm going to call our Chaos Monkey torture test to convince you that this actually really does work in a production environment. And there honestly aren't enough of you that I should be worried about. You should interrupt me and ask me questions. I guess is where I'm going. So just shortly about Big Switch, what we have seen is that there are these really large companies in the world, so the Microsofts and Googles and Facebooks of the world, who have just stopped using traditional networking vendors. They have started creating their own boxes, their own networking stacks. They've started doing things completely different from how things have always been done. So what Big Switch Networks does as a company is we try to figure out, what are these guys doing? And there's some things that they're doing that's actually not all that useful for the rest of the world, but there's actually a good chunk of what they're doing that is useful. And we're trying to figure out how to productize it and to bring it to folks who can't afford to hire 10,000 programmers to run their network. So all of our products are based off of this one Big Switch architecture notion. So the idea is if you actually have kind of traditional chassis-based switch, something like a Nexus 7K or a Cat6500, if you look under the covers, there's actually three components there. And so those components are roughly a supervisor card, something that looks roughly like a PC, a line card, and then a bunch of fabric backplains that the line cards plug into. And notably, those line cards plug into the backplane using something that's called a claustropology. So what Big Switch Networks does is we actually pull that supervisor card out of the box and we'll call that a controller. And we're gonna pull those fabric line cards out and we're gonna call those spine switches and we're gonna pull the line cards out and call those leaf switches. And so what we have done here is actually recreated all of the elements of this traditional Big Switch using commodity, low-cost, data-centered pizza, one-U pizza box switches. But the reason why people don't run their networks this way is it's actually a pain in the ass to manage a whole bunch of different switches. And so with this controller, this redundant pair of controllers, you can have a single point of control that manages this big collection of switches, potentially upwards of 40 switches, all like it's one big switch. And this one big switch notion is the core to how are we build our products and it's actually so fundamental that's how we named our company Big Switch Networks. Does this make sense so far? Someone say yes so I can move on. Thank you. So, but I'm not gonna move on, I've tricked you. So the thing that we've called this to help you remember this vision. So I claim this is the same thing that happened in the mainframe days when people sold these large vertically integrated mainframes and eventually moved to the data centers that we have today where you have desegregated hardware, you have individual servers that you actually manage like it's one big large mainframe. And it's exactly this same analogy that goes on in the networking world right now. So we've actually started jokingly calling these big older style chassis boxes. We started calling those netframes just to drive home that point a little bit more. So this is the last slide about Big Switch and I'll get back to open stack. But so we have two products. They're both basically taking large traditional boxes either a network packet broker or a data center switch and re-implementing those with one new pizza box data center switches with different smart hardware on top, this smart software on top. This software uses SDN control technology but that's honestly more of an implementation detail. The benefit that you get as a user is that you can actually manage your entire network from a single pane of glass. With our data center switch equivalent our big cloud fabric, our main use case not surprising because we're here is actually open stack. So we provide open stack plugins to manage our network. We have a neutron plugin, we have some of our developers here on staff today. We also have VMware integration and we also integrate with other forms of control software that you might use including Hadoop, Hyper-V and some other points. But for purposes of today I'm gonna focus on the open stack bits. So what happens with big cloud fabric is we'll deploy a collection of switches, leaf and spine along with and this is the new thing that we were actually announcing today in Japan along with our virtual switch as well. So the idea is that you will have both the physical switches and the virtual switches under the same control from the same redundant pair of controllers. What this means from an open stack context is we can actually improve some of the efficiencies and I'll be talking about this of things like the L3 agent. We can actually implement things like distributed virtual routing both across physical switches as well as virtual ones. That means that you can get the equivalent of distributed routing for your virtual workloads as well as your physical workloads and the combination of the two. We have our own neutron plugin. It's adapted into the, it's available through the open stack distribution and we have integration with folks like Red Hat, Morantis and Canonicals soon to come. The one thing I really want you to take away from this picture is it's the best of my knowledge. We're the only company that provides this integrated support of the physical and the virtual nodes particularly for open stack. So that's all a whole bunch of stuff about the company that doesn't particularly apply to the fun part of the talk. So the fun part of the talk is talking about how we in partnership with Dell and Morantis actually built a huge test bed. So we threw roughly about a million dollars with the hardware at this problem to figure out what are really the scaling limits for neutron. So we threw 300 compute nodes donated by Dell Networks along with a whole bunch of switches, as you can see here. Net, just south of five terabytes of data, 140 terabytes of disk. This is actually a pretty sizable cluster. This was based off of Dell R22 devices. We actually used Fuel as the installer. We ran with five open stack control nodes and two redundant big cloud fabric control cluster nodes. Any questions so far? If nobody asks any questions, I'm gonna get through this talk really fast. So this actually used a full neutron plugin. So because we have control of both the physical so the question was with the ML2 plugin. So because we have control of the physical nodes and the virtual nodes, we actually don't need the ML2 split. This is correct, right? We've got our developers in the audience here so. Yeah, I assume what Reginace was clarifying is we actually have a mode where we actually, using our V-switch is optional. And so if you actually wanna use just the physical fabric, we do have an ML2 plugin for that. But specific to this deployment, this is the full neutron plugin. This is a shot from our controller dashboard. So we used 16 leaf switches. So I'm sorry, 18 leaf switches. So this was nine racks with redundant leafs in each rack and four spine switches. We created, in our parlance, a tenant is an open stack project. So 250 some projects and over almost 900 endpoints. So this is kind of the money slide. This is the problems that we ran into and how we got them fixed. So one of the first problems, and I think most people know this, your L3 agent can be scheduled somewhat arbitrarily through the fabric. And so if you have two nodes that are in different networks that are trying to talk to each other, they may have to go someplace completely different to route through an L3 agent before they can talk to each other, even if they're actually even co-located on the same device. So if you have VMA in one network and VMB in another network and they're connected by a logical router and traditional open stack networking, that L3 agent actually could be somewhere else and you might have a hairpinning that happens. The thing that's particularly bad about hairpinning, maybe you can over provision and maybe you can have lots and lots of these L3 agents. But it also causes these weird correlated air conditions. So if a rack that's not involved with the compute nodes goes down, maybe it takes your L3 agent with you and then you have these weird issues that happen there. Another thing that people tend to do with Neutron is say, all right, well, if I have a bottleneck in my L3 agents, let me actually just make lots of L3 agents. And so kind of the extreme of that is to have an L3 agent per compute node. And so as you approach an L3 agent per compute node, you actually have a whole bunch of scaling challenges that happen. So the first thing that happens is all the L3 agents have a KEPA live protocol that runs. And that KEPA live protocol doesn't scale once you get up to the order of 300 nodes. And as a result, L3 agents starts furiously timing out and when they time out the open stack scheduler thinks they've died so it tries to reschedule them. And the rescheduling process is itself actually fairly heavy. And so what ends up happening is not only do you have these L3 agents kind of disappearing for no good reason and coming back, it actually causes excessive load on your system as they turn and get moved around. So the question was, does our Switch Lite VX actually run in user space or kernel space and is it a replacement for OBS? And the answer is actually yes and no. So we actually replaced the user space part leaving the OBS kernel space. So what we found is talking to a large swath of customers. Many people are worried about putting weird stuff in their kernel and I certainly understand that and that makes sense. So they really want a kernel module that's been blessed by the kernel.org folks. So we decided to keep using the OBS kernel agent but we've actually replaced it with our own user space agent. And so built into that user space agent is a bunch of the NAT functionality, the L3 routing functionality, the DHCP functionality and a couple other things. Does that make sense? I apologize sir, I can't hear you. Can you repeat the question? So in Reginage, feel free to jump in here as well. So what's interesting with our solution because we have the physical and the virtual, the OpenStack DVR is not DVR for physical, it's only for the virtual. And so this way we can actually integrate with the physical bare metal workloads because we can implement the DVR equivalent functionality in the physical switches as well. Anything you want to add to that or? Okay, turns out I got it right. This is the downside of being CTO, like you're supposed to know this stuff but these are the guys who write all the code. So these are the two problems we ran into as you start to scale out to 300 nodes and as folks have guessed, it turns out our virtual switch does actually solve these problems. So first we implement the distributed virtual routing and the NAT per compute node. So that guarantees that between any two points you'll never do hairpinning, you'll always follow the shortest path. And also it means that if you have some sort of resiliency issue, maybe a rack goes away, only the nodes in that rack are affected. There's not necessarily a cross correlation problem. The other thing it turns out is we've spent a lot of time trying to figure out how to do scale up monitoring of lots of switches and lots of nodes. So we actually using our OpenFlow controller and the SDN primitives that we've learned, we actually avoid all of the keep alive storm type issues and there's no need to start rescheduling all of these alpha regions all over the place. Does that make sense? That's kind of a mouthful but I actually don't even have a lot of time. So yes, so not pictured here but we do implement service chaining. We don't implement it with tagging or tunneling. We actually do something that looks a lot like smart routing. So you can think of it as between two points, by policy you can implement a service chain by saying as the device crosses the logical router we will actually change the MAC addresses and the output direction to go to the service. But we actually don't change the packet header. And the magic is after something has been processed by the service it comes back, it's the same logical router and by the interface it comes in can say okay now that's actually been processed by the service now go on to the destination. Does that make sense? And much like the DVR functionality what's cool about our solution and I think unique about our solution is we do that not just for the virtual workloads but also the physical. And so at least it's been my experience that there are a bunch of pure virtual workloads but most workloads are somewhat virtual and some physical. You get things where people wanna talk to a storage node or like a hardware firewall or something like that and whether that changes over time or not I think we're all hoping it does but Ironic is a real project, Magnum with the bare metal containers those are real projects and it'll be interesting to see how this evolves over time but my claim is that most customers have some combination of physical and virtual workloads and so this is a valuable thing. The point I wanna make here is actually a really interesting one. Doing scale in the steady state where you're just trying to make sure that things stay up that's actually kind of the easy problem. What we've done here is a test that we call the Chaos Monkey test a little bit similar to what Netflix has done where we took the Hadoop Terrasort benchmark and ran it on a stable network and it came out with some number it ran in seven minutes 20ish seconds and then we ran that benchmark again and did horrible things to the network. So we killed the active controller every 30 seconds we killed the random switch every eight seconds and we killed the random link every four seconds being careful not to cause a disconnection and so A, if this ever happened to your network it would be horrible but what happens in this network because we have multipathing everywhere because we pre-program all the backup routes there was effectively almost no discernible change to the performance numbers it still ran in seven minutes 20ish seconds give or take. So the lesson to take from this is even if there's horrible things going on in your network as long as you have this style of SDN control you can actually recover so quickly that your application barely notices. So and we did this at scale as well so I mean again this is a it's easy to put up a 300 node test network that runs in a steady state but as it starts recovering from this scale of link failures this scale of switch failures that's really where the rubber hits the road. Questions on this? If anything I've said to you is interesting so far today we actually have both of our products both the big cloud fabric that I talked about today and the big monitoring fabric that I only briefly mentioned that they're available online with a free trial. So if you just log into labs.bigswitch.com you can find out more about these products. But in conclusion I feel like we did something that I don't know that many other people have done we really scaled out NOVA to 300 nodes. And actually I should say 300 nodes wasn't the limit of the software that was actually the limit of the hardware that we could throw at it presumably it scales out even larger than that. And one of the things that we did is scaling the steady state is relatively easy but with this chaos monkey testing that we did I think that we showed that it's actually scales out in a real production bulletproof kind of way. There's a bunch of details that I haven't had a chance to get into here in the talk feel free to grab me or some of my colleagues there's a couple of us around afterwards or we've got the white papers online. And with that I thank you very much and I'm happy to take any further questions that you guys have. Thank you. Sorry a lot of background noise. Do you know the answer Rajneesh? I don't. So the DVR as far as I know is actually a relatively new thing. So this was with Juno. So for those who are missing this the other point is the existing Kilo DVR implementation I think doesn't do high availability failover of the floating IP state. So if you lose a DVR instance you actually lose the state for the NAT transactions. Is that correct? Yeah. Okay. All right. Thank you everyone.