 So welcome to our talk at five about Cumulus Linux for network switching. So let me introduce Nolan Lake. So please give him a warm applause. All right. Thank you. I'm here to talk about Cumulus Linux, which is a Linux distribution for network switches. So Cumulus Linux is Debian-derived very heavily. In fact, many of the packages in Cumulus Linux come straight from the Debian repo. Some have patches added on top of the Debian patches by us, which obviously then we would rebuild, and then of course some are entirely new. If you're familiar with more common network switch operating systems like proprietary ones like Cisco's NXOS or Juniper's Junos, you probably expect a network switch to have some weird CLI that you have to learn with a bunch of commands and a limited functionality set exposed by those commands. Cumulus Linux is not like that at all. We have no proprietary CLI. The CLI is bash. If you SSH into the switch, you just get a bash prompt. All of the front panel ports show up just as Ethernet devices just like ETH0, ETH1. All of the commands you have come to Nolan Love for configuring networking on Linux just work out of the box. So you can use any program that works on Linux on the Cumulus Linux switch. For example, for DHCP, we just use ISC DHCP for routing protocols. Instead of writing our OSPF and BGP, we just use Quagga. But if you prefer bird, you can just install it and use it. But let me back up a bit and talk about what a switch is. So we have a set of hardware partners that manufacturer switches often the same ones that you would buy with a proprietary OS on them, or a fully proprietary OS will get into that. But internally, these things are just a little computer. Some of them have power PCs, some have Intel chips, some even have ARM now. And they have RAM, they have storage. So it ends up being relatively straightforward to run Linux on it. The only thing that's unusual about them is that they have a really big ASIC right in the middle that's PCI Express attached to the CPU. And that is the piece that connects to all those front panel ports. This one, for example, has 48 10 gig ports and four 40 gig ports, which is kind of a lot of networking capacity to plug into a server. So that's why it has that special ASIC. That's going to actually handle all the networking functionality and hardware. So the software architecture is generally breaks down to exactly what you'd expect with one kind of caveat. There's a ton of tap device created for each one of these front panel ports that's then, you know, behind the scenes connected to the actual front panel port for handling traffic to and from the CPU itself. But then all of the other kind of data structures, instead of being a big blob of proprietary software like most network operating systems, we just use the kernels data structure. So for example, the routing table, that's just the kernel routing table. Bridge configuration, that's just whatever you threw together with BRCTL. And similarly, instead of some, the ACL table is just IP tables and EV tables. So it's all stuff, probably everyone in this room knows how to configure. And also there's a huge amount of software that knows how to drive. And so we have, you know, some prerequisites. Obviously, you know, you need some way to install on a switch, right? Normally when you buy a switch, the OS is already baked in there, right? So it came from the factory as kind of one appliance. To kind of break these things open and separate the hardware and software, we needed some way to install. We needed something on the switch when it came out of the factory. And so that thing is something called ONI, which we've actually contributed to the open compute project. What ONI is, is a small, very minimal distribution unlike ChemoSonic, which is based on Debian and has a huge set of functionality. This is a stripped down little busy box thing. And it's sole goal in life is to boot up and play the role that Pixi does on servers, only to be a lot better, right? Pixi kind of sucks, who likes using TFTP? You know, it doesn't support IPv6. I mean, it's not great. So with the base of Linux under it, it was very easy to do things like support IPv6 to allow you to install over SFTP instead of using TFTP or, you know, HTTPS or whatever is convenient in your environment. And this is just used to install time. Once the switch is booted into ONI and it's found an image to install, hopefully ChemoSonic, but there are others available, this is no longer involved. It will then boot directly into the real operating system. And so I wanted to talk a bit about our contributions. So we're one of the largest contributors to Quagga today because this is the routing protocol suite we opted to use. And so we did a large amount of work on the OSPF in particular the OSPF v3 implementation, which is what you use for IPv6. And we've also done a large amount of work on the BGP implementation. And so this is all stuff we've upstreamed back to Quagga. So this is all very, you know, we don't keep any patches proprietary. And then the kernel, we've been contributing heavily there. We were bad about that for a little while, but we've gotten our patch backlog mostly upstreamed, so there's still a little bit left to go there. And then relevant to Debian, we've actually rewritten, if up down, so that's, as I'm sure most of you know, the tool that manages Etsy network interfaces. And if you've ever looked at it, it's written in this weird literate programming language thing, I believe it's C-Web or NoWeb, I can't remember which one, offhand. So it was exceedingly hard to modify and work on. So we rewrote it in Python and we did some things like adding template libraries. We used Mako for that. And the thing that motivated, there were a couple of things that motivated us to do this. The biggest one was just scalability. When you have 48 or 52 front panel ports and you have a large number of VLAN sub-interfaces hanging off of them put into bridges, you can end up with thousands of interfaces. And if up down one was not scaling particularly well to that. The other big thing we had, we fixed was that with if up down one, the order of the interfaces in the file is extremely important. If you put the VLAN sub-interfaces before the bridge that includes them, it won't work if up won't actually bring them all up. And so what if up down two does, is does a sort of all the interfaces in dependency order and brings them up in order such that each step will succeed and when you're done you have the full configuration up. And one other thing we added was reload support. So the ability to edit the Etsy network interfaces file and do basically networking reload, that compares the current state in the kernel and the new config file and figures out what the minimal changes of non-disruptive changes required to bring the kernel up to date. We're currently working with some folks to try to get this as an actual package in Debian. We're not there yet, but we're definitely working on that. Now we'll get down to the low light. So I hope nobody brought fruit to throw. We do have one proprietary piece. If we go back to the diagram, it's the red part in the corner. It's actually this diagram is slightly wrong. The part in the kernel was actually GPL. We worked with the hardware vendor to get them to open that up so that we didn't have to have a proprietary kernel module because that would have been bad. But the part up there switched the, that is actually proprietary because it talks directly to the ASIC. In this case it's a Broadcom ASIC. So who here has dealt with Broadcom in the past? Okay, so you know what's going on. So we got the information needed to program this under an NDA. It links against their software development kit which is obviously proprietary. The good news is that Kemos Linux, the entirely open source part, is still completely functional and actually very usable without this piece. All you lose is you can no longer run it on an actual switch. You can still run it in a VM to be like a router VM or a switch VM. You can still run it on an x86 server if you wanted to build a router out of that. All of that other functionality, if up down to all of the Quagga routing protocol enhancements, all of those are still baked in. In fact, you can download a VM image of this from our website if you'd like to try it out. We are trying to get towards a better future here though. We've been working actually, we were involved in the initiation of a project to bring SwitchDev into the kernel. This is similar in spirit to NetDev. NetDev is a way to describe NICs. So you could have different drivers for your different NICs. So you get, you know, your e to zero. SwitchDev is the same thing, but for these switching ASICs. The big problem that we're running into of course is these vendors are all extremely paranoid about their programming specs. I make the argument fairly frequently that, hey, the more people who know how to program your chips, the more people write software for your chips, and then the more people will buy your chips. But the harder industry is a little paranoid. So, you know, we've seen a couple of responses. Some vendors are just absolutely no, like we don't want to have anything to do with this. Others decide, oh, no problem, we're gonna open source our driver, and then you get it, and it's a bunch of one line stubs that RPC across to an ARM core on the chip that's running the giant proprietary blob, which doesn't seem like it's really any better. So this is gonna be a slog. It's gonna be a long road. I'm optimistic that we can get to the end of this road in a satisfactory manner, but it's definitely not gonna be easy. So that's all of the material I had. Any questions? In terms of the drivers for the various ASICs and SOCs, have you, I hope you've heard of Open Data Plane? Open Data Plane, yeah, that one is focused, so I mean, may have probably sort of elaborate a bit more. We're entirely focused on actual fixed function forwarding ASICs. Open Data Plane is more around SOCs like Haviums that have, most of the forwarding is done in software, but they have offloads to offload certain fast paths. So this is very different. All of the forwarding is done in hardware here. Sure. There's no fast path. I should mention, I work for Leno in the networking group, so I know the people who developed Open Data Plane. Oh, maybe we should talk afterwards. Absolutely, yes. So you said this is just a normal batch, and I can use all the usual tools, but the advantage of iOS, iOS, R, what have you, is I'm kind of protected, at least in some circumstances, against executing things which are then executed on the CPU and not on the silicon anymore. So did you build any safeguards which tell me, okay, you are about to leave whatever the silicon can do, and you will need to enter CPU term at CPUland and just break all performance? Yes, absolutely. So we actually never fall back to the CPU because the performance disparity between these CPUs and the highest end parts can do two and a half terabits for a second of switching. The CPUs can't do a hundredth of that. So we'll never actually fall back to software. We'll almost all scenarios, on all normal scenarios where you would end up not being able to do it in hardware, we'll warn you and roll back the change. There's a couple of minor kind of edge cases where that's not happening today. We consider those bugs to be fixed. If I'm not correctly, Cumulus is a ribbon derivative, right? How different is it from Debian? We try to make it as close as possible for the most part. We actually use Debian binary packages for probably 90% of the packages in the system. Things like Quagga, we've worked on extensively, so we have our own version that's kind of managed, same with the kernel, and we've added some software. We switched from using it down to down to, but for the most part, it's Debian. Right now we're easy based. We're in the middle of the process of porting forward to. It'll be in the next release. It's gonna be a bit of a time, though. Which is, or is that just the 48 port, one chassis that you're targeting? So right now, we don't have any philosophical ties to any particular hardware form factor. Right now, it's these 10 plus 40 uplink switches with either four, six 40-git uplinks. We also have 32 by 40. 32 by 100 is coming soon. We also have one gig switches that are 48, one gig plus four 10-gig uplinks. But still just one chassis? Yeah, so philosophically, that was, yeah, we're not philosophical about that. What we are philosophical about is we don't do stacking. The problem with stacking in these kind of like chassis where chassis protocols, they're incredibly proprietary, incredibly brittle. And so they're great when they work, but when they don't, you're kind of at the vendor's mercy. So we actually, what our customers tend to do is use tools like Ansible or Puppet to orchestrate all the switches. And they tend to do more L3 instead of L2, so they end up using open protocols like IBGP or OSPF to kind of stitch all of this together. So that way, when something goes wrong, you can go dump all the OFPF state and say, oh yeah, the adjacency didn't form on this link right here. And that's why this other path is getting overloaded. Let's figure out why. Or the Mac is missing on this machine for some reason. Let's go figure out what happened. What do you think about the roadmaps that other vendors are following, like putting compute and switching onto the same box, putting VMs onto the same box? I mean, I view them as following us, right? It's like, we didn't have to do anything to let you put a VM on our box. You just have to get install KVM and off you go. Okay, but that's already something that you intend to do with this box. Oh, we've done that from the beginning. Okay. I mean, literally the first thing we did was bring up Linux on this thing. And Debian specifically. And so we hadn't even implemented the Harvard forwarding. So at that point, the only thing you could do on it is run VMs and things like that. Do you by default separate the management VM or is the switch management, I say DOM zero or is that user VM or is that? So we don't support Zen, so you'd be using KVM in this case. And in a default use case, if you're just using the box, you wouldn't have any VMs. There's just the host kernel and its routing table is what the hardware is forwarding. And we do have management separation, so you can have separate routing tables for the eth0 kind of management port versus actual data plane ports. But that's just using IP rule and the multiple routing table support that's already in the kernel. What about fancy stuff like NetFlow and EVPN? We support Sflow, not NetFlow. I think that's a Cisco proprietary one, so we'd be less likely to support that one. EVPN is something we're working on in the context of carrying VXLAN, VNI tags around, but we don't have that in a shipping version yet. But it's something that would be totally reasonable to do. And unlike most vendors, if you decided you wanted it and you had a Emacs open and a C compiler, you could hack our quagga to do it. All right, thanks. Are you committed to VXLAN or are you also planning to support MPLS fully? We are working on MPLS. I don't know how fully we would support it. I don't know if we're gonna try to replace big telephone company carrier routers MPLS, but for the uses of MPLS inside the data center and between adjacent data centers, that's definitely stuff we consider and are working on. So when you look at long-sterm stability, take is Catalyst 6,500. They may be old, but they run for years and they do what they need to do. How, as of today, do you view Cumulus Linux? Can I just deploy it now and have it run for five years? As of right now, or do you have a realistic expectation of when you reach that point? As soon as I secure the machine, I can just keep it running forever? Yeah, I mean, we've only been around for five and a half years, so there haven't been any running for five years, but we do have a large customer that's been running approximately 30,000 switches, running Cumulus Linux for two plus years now and no major issues. Yeah, but anyone at that scale would probably upgrade quite aggressively because they just can, but sometimes you have machines in places where you cannot really do that. So is this anything you're looking at or are you basically saying people just need to upgrade regularly and that's that? Well, we would generally advise people to upgrade, but we do provide security patches to older versions, so we're not gonna leave people high and dry, but any sort of new functionality or bug fixes that aren't security relevant, the older releases just aren't gonna get them. When you say you support Quagga, but you're also able to run BERT, I prefer Quagga, so that's good, but what are you doing about the performance of Quagga, especially when it comes to BGP if you've got large routing tables? So are you looking at multi-core because BERT is really, really quick and Quagga is really, really slow? Yeah, we've been doing a lot of work on Quagga performance and scalability. We've improved it, probably not as much as it needs to be improved, so we're still working on it, and we also did things like, first, in a high number of sessions, we switched using E-Poll instead of select, these are all things that we consider and test actually first, and then fix when we find it. Thank you. So 30,000 switches, that's a production use case, is it? How about 1x on the ports, PoE on the ports, and during upgrades that you mentioned, is the switching gonna continue on the pure ASIC level or is the entire switch rebooting? We don't do ISSU, and that's another one of these philosophical points. I would say you should build a denser, interconnected network such that you can take down a single switch, upgrade it, bring it back up into the adjacency, and no one would even notice, and so then you can do a rolling upgrade because the problem with ISSU is it's a really complicated thing. You're persisting all of the live runtime state of the software somewhere, rebooting, and then reloading that with a totally new version of the code, reading that old state, and then trying to pick up where it left off, including things like not breaking connections to BGP peers, so remembering what sequence number you're at and where the BGP state machine was at, and this ends up being pretty brutal. But the point is if you don't support stacking, you don't have the option to put a downlink server into a dual home scenario. So we support PC or something like that? We support MLAG, so you can have a server with two links that are bonded together to a pair of top or X switches, and then the spine should be redundant using routing protocols, so if you architect the network with redundancy at every level, you can do a rolling upgrade across it with zero downtime. So which part of your switches would be spine and leaf? All of them? Yeah, yeah, yeah. Okay. Was there one more? Yeah, we're running out of time. Yeah, you said you support MLAG, how do you do that? Is that part of the priority switch D? There is one tiny little piece that's currently in there. We're working on pulling that out, and probably we'll have that pulled out in the next month or so. The rest of it's just a Python program that is a complicated state machine talking. MLAG's a very complicated thing. I didn't wanna do it, but people really, really like it, so we did it, but it's all essentially just a Python program. It manipulates the bridging state and the STP blocking state of various things, and then we had to modify the MSDPD we use to obviously be MLAG aware, and they coordinate with each other. And then there's also some EB table rules that get added at various times, but it's all the same thing you would do in a proprietary MLAG, it's just we're reusing the kernel constructs that they're already there, because hey, they're there, so let's just use them. And we had to modify the kernel a little bit as well for trying to get that upstream as well. Most of it's already upstreamed, excellent. Very good. John was just filling in that actually I misspoke, most of our patches have already been accepted upstream, so there's only a handful still left to do MLAG support that are not upstreamed yet, still working on those. Thank you very much. All right, thank you.