 Hello, everyone. Thanks for coming to our talk today. I'm Nolan Leake. I'm the co-founder of a co-founder of Chemulus Networks. And unlike my friends here, everyone here might not know what Chemulus Networks does. So I'll give you a brief overview. We have a Linux distribution called Chemulus Linux that is unusual in that it runs on switches. It runs on top of rack and spine switches. And it is essentially hardware accelerating with the networking functionality of the existing Linux kernel. I'm Brad Watkins. I work for Red Hat in the NFE Partner Engineering Group and worked with Nolan on this installation. Hi, I'm Jevent Wirk. I'm part of Dell Networking. And I'm responsible for software-defined networking and solutions around OpenStack. OK, so why did we do this? And here's the overall agenda. We're going to talk what we're trying to prove here. How did we go about planning it out? How did we use virtual machines to help us get ahead of the game when we actually had access to the lab and then what we learned from this process? So we wanted some questions answered. We wanted to prove out that we could very quickly and in an automatic fashion roll out an entire OpenStack cluster covering networking, the control nodes, and the compute nodes. And we had limited time. And Dell graciously gave us some time in their lab. But it was a limited amount of time. So we wanted to make sure that we knew everything would work before we got in there. So we decided to prototype everything in virtual machines first, both the switches and the servers. And we wanted to try a not uncommon but still somewhat uncommon topology, which is using VXLand to provide the L2 connectivity to the VMs and less commonly, running the layer three boundary all the way down to the V switch. To do this, we used Ansible and Git, the idea being that we can have a single kind of unified set of tooling for both the servers and switches. And we wanted to see if we could do this all remotely. I'm located in Mountain View, Brad, I think. I'm outside of Detroit and Santa Clara, California. So one person local. But so we wanted to see if we could do remotely. And to spoil the surprise, the answer is yes. So our first step is we needed a lab. This is the lab that Dell graciously let us use. This is actually the only time I was in this lab for this entire project. And someone else was using it at this point. So I wasn't allowed to really touch anything. I was kind of breaking the rules a bit there. So you saw that picture over there. What essentially this lab is trying to be trying to do in this lab is build a proof of concept. And any concept before any customer would like to see any performance, benchmarking, or any other feature in an end to end solution where they would have to have a scalable rack of servers, scalable number of switches, and storage, and everything included. So the first thing that we do have in this one is about nine racks worth of equipment, which is pretty significant and really large for most data centers are probably half that size in many cases. So this is a fairly large data center like emulation, which we try to make. And one of the big things about it is that we try and have this setup remotely accessible. So except for maybe one time somebody stepped in there. But I think every single part of the deployment of OpenStack and Cumulus and everything was done really all remotely over VPN. Essentially, these are R220 powered servers from Dell with 16 gig RAM and quad core processors with 10 gig adapters for dual homing into top of rack switches. So we try and build an end to end a fairly large leaf and spine architecture using this particular lab. And what you actually get out of it is a fairly large, significantly large bandwidth handling for east-west traffic. So overall, this is about six spine switches, 18 leaf switches, which is about good enough for about 15 terabits worth of east-west traffic inside the network. And so these switches are essentially tried and two-based. They are capable of doing VXLan hardware offload at line rate, so they can do the VXLan layer to gateway functionality as well. So the first step is we need to design the network. And so one slightly unusual thing we decided to do was to run routing all the way down to the hypervisors themselves. So to achieve that, we deployed the Cumulus version of Quagga, which since it runs on Cumulus Linux is happy to run on any Linux. So we deployed that on the Red Hat Enterprise Linux on the servers and had it peer with the top-of-rack switches. So if you see those two links going up to the top-of-rack switches, those are not a bond. That is not an MLag. You see there's no inter-switch link between the two top-of-rack switches because it's not needed. So what's happening is the servers themselves are announcing their IP address, their underlay IP address that's receiving the VXLan encapsulated traffic up into the network. And then the switches are distributing it around. And so this is interesting because we don't have to have MLag configured with all of the kind of complexity and brutalness and proprietary-ness of the various MLag protocols. We do support MLag, but we just didn't do it in this one. And since both the servers and the switches are running Linux, we were able to use a common set of tooling, in this case Ansible and Git. And that way we could have, for example, the template that configures BGP on the servers is the same template that configures BGP on the Leafs, which is the same template that configures BGP on the Spines. Because we used a configuration called unnumbered BGP, where the only thing that changes in the configuration of each router is its own local IP address. There's no IP addresses on each individual link. And of course, since these were Dell open networking switches, we were able to use ONI to automatically install Cumulus Linux on the switches. So once we had a topology, we obviously, given the time constraints that we had, wanted to be able to prove it out in a virtual environment. So one nice thing about the Cumulus Linux is that they have the VX available for download for anybody who might want to try it, and as well as us, to be able to prototype it. So essentially what we did is this, but virtual machines, topologically, it's exactly the same. We did have two Spines and four Leaf switches, again, all running Cumulus VX, and then a number of virtual machines running Red Hat OpenStack Platform. That helped us to be able to prototype the Ansible playbooks that we were going to end up using in the physical lab once we had access to it. Again, just being able to build up the inventories, that kind of thing. Once we had all that going, we moved on to the physical. Again, we used exactly the same Ansible playbooks we wrote when we were prototyping in the lab, and we're able to use that to deploy the overcloud. We did those in batches primarily due to some hardware constraints that we had. It was just easier to do it that way rather than 300 all at once. If we had some slightly more substantial systems, we would probably have been able to do it in one step. But once we had that deployed, we were able to create 1,000 tenant networks across compute nodes, and then do some testing on them. Yeah, and to emphasize just how easy it was to move from the virtual prototype to the actual deployment, we literally just checked out the same Git tree with Ansible scripts in it and ran it, and I was pleasantly surprised that it worked first try. So what did we end up with? So we had the common tooling, in this case we only took advantage of the common tooling for the deployment and management, but we could have also installed common monitoring. So if we wanted to use collectee or something like that to monitor the system, we can use the exact same collectee on the servers and the switches so that we can get a unified view. We can have a single pane of glass that's showing us the entire system in an integrated fashion. And the networking side, and this probably the small stripped down version testing in VMs probably took five minutes to deploy, the full version of all 24 switches took about 15. Oni itself, which is essentially the more equivalent of Pixi, only it supports things like HTTP and HTTPS, so it's a little bit more modern than TFTP. That took about a little under 10 minutes to install all 24, cumulus Linux on all 24 switches. And once that was completed, the switches then reached out using a technology we call ZTP, which they basically go back to the server where they got their cumulus Linux image and asked for a script. And then they downloaded that script and run it. And so in this particular case, that script was extremely simple. It simply installed the SSH key so that Ansible could connect to SSH and configure the switches the rest of the way. And then the actual Ansible configuration took about five minutes, maybe a little bit more. So all this configuration ended up being maybe 50 lines of Ansible, including the templates. So it was extremely straightforward and as I said, very fast. Sure, so the undercloud took about six hours. It's, you know, with that number of systems, it just takes a while for them to work through the process of, because OpenStack Platform Director is based on triple O. You have to go through the introspection process and then through to the deployment. So it took a little bit of time, but was successful. Once we had the overcloud built, we were able to do some stress testing with Raleigh and do some analysis with a project called BrowBeat, which is written by one of my coworkers, Joe Tolerico from Red Hat. I should actually take a quick moment to shout out to Joe, who is also instrumental in this project, as well as Life Madsen, who was on our team working there. But yeah, that's good. So, I mean, unfortunately, unlike many of the presentations I've given here, I don't have a giant stack of switches for a live demo and we don't have access to the lab anymore, so we'll have to settle for a screenshot. So there's a Firefox extension I found that allows you to take a screenshot of like the entire webpage, even the parts that scrolled off the bottom. So while the system was up, I pulled up the Horizon hypervisors list so we could see all the various hypervisors. You can see that they have the eight VCPUs and they have 16 gigs of RAM and some hard drive storage locally. We're not running anything right now because this was just when it first came up, so there's a few of these. And there you go, you can see all 298 hypervisors and we did eventually spin up quite a few VMs on this and about 1,000 virtual networks. So we were able to prove out what we were attempting to prove, which is that we could quickly, using automation, install all these and configure all of these nodes and then the configuration would actually work at this scale. So that was good and interesting because I think the previous largest deployment, open stack deployment with the VXLAN configuration we'd done was about 200, a little over 200 nodes. So this is a nice step up from that. So hopefully this sounded interesting and you're chomping at the bit to try it yourself and so the good news is you can do that completely for free. Well, as long as your laptop has like eight or 16 gigs of RAM, if it doesn't, you have to buy some more RAM, but you can download what we call the rack in a laptop. Part two is the one that uses VXLAN, so that's probably the one you'd be interested in here. We have a part one that's a very simple traditional VLAN-based and what it's gonna do is spin up a couple of VMs. We have a free download called Cumulus VX and it's a version of Cumulus Linux that runs in a VM. It runs in VM more of VirtualBox or KVM. And in this case, it'll spin up three of those, one representing a spine and then two representing top Rx switches. It'll also spin up two CentOS VMs, one to run the open stack controller and one to be the compute node. So this is kind of the minimal configuration that kind of shows all the moving parts. And there's also an external router that's pretending to be whatever your upstream gateway to the internet of the rest of the data center is. And so this will completely automatically configure all this and then you can kind of poke around and all of the code is available. So if you wanna see the ansible scripts that are configuring the various VMs, those are in GitHub and you can kind of take it apart, play with it, spin up nested VMs and play around. So if you'd like to learn more, we have a press release, but since you've already heard this, that's probably not very interesting now. MovingPackets.net did a review. They seem to like it, so that was good. As I mentioned, Kemos VX, you can find it on our website. It's completely free to download. It also has vagrant support, so you don't even need to go to our website. You can just make vagrant, pull the image down from their repository. And then all of the playbooks and the RPM package of our Quagga, that the Red Hat folks made are available up on GitHub. These slides will probably be available somewhere so you don't have to frantically write all that down. And if you wanna take the next step, we have a community. You can come ask questions. People can help you if you're having problems with the VX or the virtual demo. And we have a lot of other kind of labs and demo modules that you can play with with VX. To kind of learn interesting pieces. This is just one to learn open stack with VXLan. We have open stack with VLan. We have configuring OSPF, BGP, MLAGs. Just all the various kind of networking tasks that are common. And that's it. Thank you, everyone. We have time for questions. So which part did you spend most time troubleshooting? Interospection. I will say that we used OSP7 for this. Interospection in the Kilo release is not nearly as good as it is in Liberty. And thus, open stack platform 8 is substantially better in that regard. Yeah, I got off a look easy. I only had 24 switches to deal with. They had 302 servers that they had to deal with. Thank you. So one of the things that I'd like to highlight in this one is that what was the real thing that we really got out of this particular proof of concept really. Essentially, a common tool chain when it works in the same way as you would provision your servers, the same tool chain is now able to provision the network. So normally it would have taken six hours for the networking and maybe three hours for servers. But in this case, because of the common tool chain and running cumulus Linux versions really, which are optimized for your networking, that made it so simple that you're a common Ansible script could actually go ahead and push all those network related changes and took less than 10 minutes to set up the network. So one more thing that I would like to say here is that having a common tool chain essentially cannot come unless you have your networking really kind of follow an open networking model really. So one of the things that the networking model was following was open UN. So these switches run an ONI bootloader which essentially works like a pixie boot of the server the same way you'd be able to pixie boot and run any version of a Linux operating system on the switch itself. From that point onwards, then the power of Linux really takes over and you're able to kind of do your DevOps and rapidly change your deployments and the change management really happens at a very accelerated pace. Any questions? I guess it's got some time. So one other interesting thing I forgot to go over is that while we were doing the actual live config as I mentioned, none of us were actually physically present but what I didn't mention is that we were provided a VPN in. And so for the Ansible configurations we were actually, or at least I was, I don't know exactly how Red Hat folks did it but I was able to have the Ansible on my laptop run Ansible locally and have it reach out over the VPN and SSH into all of the switches to configure them. And so this is very similar to a common actual production deployment scenario where you usually wouldn't be over a VPN. It's often you're actually in the data center with your laptop plugged directly into the management network and then you can run your Ansible scripts directly. So it's very, very convenient. Many public cloud. All right, I think we are probably giving them back what, 15 minutes? Yeah. Thank you.