 Our next talk is going to be about open networking and a solution that Attila and his team is doing for having open hardware and open software in data centers and how that will look in the future. Give a warm welcome to Attila De Groot and enjoy this talk. Thank you. So a few people might know me for organizing here at Shaa, but during the day I also have a day job. I work for Cumulus Networks and today I'm talking a bit about open networking, how that evolves around Cumulus and hardware, software, and data center network design. So Cumulus was founded about seven years ago, so I don't really want to call it a start-up anymore. But one of the things was that if you look at typical data center networking, that the large, well, pro fighters don't really use typical hardware anymore. So if you have a classical enterprise using Cisco, VMware, and EMC for your storage, that's very classical. But if you then look at a Google, Facebook, they typically do it differently. For example, Google and both Facebook, they wrote their own operating system that runs on top of their switches and that works better with the applications that they are running and it integrates more well in their processes. Now the thing is that most companies, if you're not a Google, then you don't have the capacity or resources to actually develop an operating system on top of your switches. Now that is the idea of what Cumulus does. So instead of buying a box that comes with an operating system and basically is quite closed, you buy a box, so you have the freedom to choose your own hardware, including cables, optics, and so on. And on top of that, you get a network operating system. Of course, Cumulus prefers if you use Cumulus Linux, but there are a few other operating systems around. And the idea is that the flexibility of choosing your own operating system makes it also more flexible to integrate it with the applications that you are using. Now first, let us have a bit of a look at the hardware that is available. So in my position, I'm an SE, which means that I report to the sales department, but the idea is that I help customers to choose their network and help design it and so on. And one question that I simply get is like, okay, do I buy some fake Chinese hardware and how does that actually work? If you look at data center switches, then basically there is not much difference between either a brand or a white box switch. Most of the data center switches run on Rottcom hardware these days, about 95%. And there are only six ODMs in the world that actually make the hardware. So for example, if you look at Edgecore, they are part or a subsidiary of the Acton group, that same ODM also makes hardware for named vendors. So that means that you get exactly the same hardware. It just has a different color or doesn't have any labeling. But it all runs on that same chipset, so it can do exactly the same. Now one thing is that most white boxes until the beginning of this year didn't come in a chassis model. Now if you're building large scale networks, it might be preferred to actually have a chassis. I'll get into the design parts later. But the problem is with classical chassis that they have proprietary blades and network processors and simply don't have the architecture open that allows other operating systems to be installed on it. Now Facebook made a change in that. They released the design of their backpack chassis, and these are some photos that I took in our lab when I was in the US. And what you're seeing is a 128 port, 100 gig chassis. And basically it is a box with multiple switches in it. So on the right side you see one of the line cards. It has two ASICs, two Broadcom ASICs. And you have front-facing ports, and on the back you also have a connector. And what it does, on the right side you see the same blade again. And you basically push it into another blade on the back, which means that you can create line rate forwarding again. So you also have to manage it, in this case this box, as eight different switches. Now if you then look at the architecture, they call it a spine in a box, or a clasp in a box, because you see all the line cards and it's connected as such that you have that full line rate connections. So for an operating system, if you're installing that, if I go back to your previous slide, you see eight modules on the front. And that means if you, for example, run cumulus on it, you have eight instances of cumulus linux running on top of that. So if you're managing your network, automation is quite key here, because you don't want to manage all of those instances by itself. Now if we then look a bit on cumulus linux and the software part of it, we are based on DB and Chessie, and we also use everything that comes with it. And you could say that, for example, Cisco and XOS or Arista is running on top of linux kernel as well. That is true, but they have their own user space applications that are handling all the network functions. What we are doing, all the forwarding, we use the linux kernel for that. And what we basically added is a switching daemon. And the switching daemon looks at the kernel netlink messages and programs the hardware. Now if you actually want to get that information into the linux kernel, for example, if you want to do routing, you can use Quagga or actually a fork of that, which is free range routing. And you can use that to set up your routing protocols, so OSPF, BGP, and so on. Now, our customers or anyone using it, they have the freedom to also choose a different routing suite. For example, if they're not happy about Quagga or if it doesn't fit their needs, then they can also install Burt. Something happened, for example, here on the camp. We were testing, and we run into something specific. And they just had the freedom to choose another routing daemon, which means that you have that flexibility. Now, one of the key things, if you look at the network operating system these days, either from a normal vendor or cumulus linux, is that you want to automate your network, simply because it's not feasible to manage all devices manually anymore. Now, what is specific about cumulus is that it's a regular linux distribution. So you can use your standard tools like Ansible, Puppet, and so on, which makes it a lot more easier to manage. And I'll get back to that in a bit. Now, also the monitoring. Typically, we've been monitoring networks with SNMP. But SNMP is based on ASN1, which has not all the functionalities that we would like to have. Now, because it's a standard linux box, you can also install the same monitoring tools that you're using on your servers already. And I think that's more easier to integrate, because you have the same configuration that you can also use on your servers. So what we see is that sometimes people just read it like a server that happens to have 48 ports. And you can manage it then with pushing out data with collectD to a time series database like influxDB and do your monitoring in that way. Now, one of the things what cumulus does in the open source world, because pretty much most of what we develop is upstreamed. So we upstream a lot of the development that we do on the Linux kernel. For example, VRF support that simply didn't exist for the Linux kernel. There were some attempts with namespaces. But if you look at white box switching and large-scale networking, then using namespaces on your boxes that doesn't provide scalability. So VRF support is in the 4.9 kernel. And I believe in the latest Ubuntu release that's also included. Of course, Quagga and free-range routing, if you have followed a bit the development of Quagga that didn't progress very well over the last few years. So cumulus, together with a dozen other companies, they decided that it was good to make a fork, something that you don't really want to do unless absolutely necessary. But that fork is free-range routing, and it includes all the routing protocols. And it's pretty good to use. Also, IF up-down too, if you've ever worked with Debian, for example, and you do a network reload, you see a message that it might not work. Now, that is one of the reasons that we developed our own network manager, because that is something that you don't want to do. For example, IF up-down 1 doesn't have a partial reload. And if you have a network box, you don't want to reload all the ports, affecting every connection that you have. Now, one of the things, if I'm talking to people and talking to the community, you also have to explain data-centered network design, because basically you're coming with a new concept of open networking, and the last network refresh was maybe 10 years ago. And then, okay, what are you going to build, and how are you going to build your network? If there's anyone with a network background who did any type of certification, this diagram is pretty standard. It has an access distribution and core layer. And that's based on a time where applications were very server-to-client-centric. If you configured something like that, your network was fully layer 2. It was based on having spending tree, or if you were lucky, then you had MLAC. But that was all based on proprietary protocols, and that limited scalability. So what you do these days is, what you do is you build a class network, or a spine-leaf network, and that isn't something we invented. So if you look at all the vendors these days, also Cisco, Juniper, Arista, if you make a typical data-centered network design, you build a spine-leaf network. Now, one of the limitations of a classical network topology is that you have a limitation of two cores. Now, if you build a spine-leaf network, you typically start with a routing configuration, which means that you have routing links between all your leaf layers, which are the top-of-rack switches and the spines, and that means that you can balance all your traffic over multiple spines, and it's only limited by the number of ports that you have. So in this case, if you want to have six spines, then sure, that's perfectly possible, and you have much more bandwidth inside your data center, which you need for storage, VMs, and so on. Now, if you then look at that routing configuration, how do you usually build that? Now, between your leaf and spines, you either use OSPF or BGP, although if you look at most recent configurations, then BGP inside your data center is becoming the default. LinkedIn and Microsoft also wrote an RFC about that, and, well, BGP has been proven to be pretty scalable. And one of the advantages is that you simply don't have any spanning tree in your backbone anymore. Now, what you typically do is you want to dual-connect any server. MLAG is pretty standard to do that, which means that from a server perspective, the switch looks as one device. You still have to manage them separately, and that is how you can provide redundancy for your servers. Now, one problem with layer three networking and having routing in your backbone is that you don't have layer two anymore. Personally, I like that very much because layer two can cause a lot of issues. The problem is that there are two cases where you actually would need layer two, and one is functionality, because for VMs, if you want to do a move from a VM from one hypervisor to the other, you need to be in the same layer two domain. Also firewalls and such, they still might want to use layer two protocols to do synchronization. So for that, to solve that, you typically build an overlay network. Now, I'll go more into the details later, but the typical protocol to do that is VxLan these days. There are several other overlay protocols. However, VxLan has become the default. Now, there is one other option. This is something that we kind of introduced, or at least from a marketing perspective, but it's not really new. And we call that routing to the host, and what we basically do is you install that same routing daemon, in this case, through-range routing, or Quagga, and you install that on your servers. So if you have all your servers running Linux, you can install Quagga there as well, and that means that you don't need MLAC anymore through your switches. Well, that could be an advantage for some people because MLAC is still layer two, and you still have dependencies between the two boxes. You have a daemon that is synchronizing, and MLAC is also not a standard. Every vendor has their own implementation of MLAC, and it doesn't seem to be becoming one standard. Now, what I said before, automation is quite important for us, and we think that's the future because you simply don't want to configure everything manually. So what you want to do is your equipment that arrives at your data center, and from that, you don't want to send an expensive engineer to connect everything, you just want to send in a ticket to remote hands to cable everything up, and from that point on, everything should be automated. Now, one of the things that helps here is a bootloader, ONI, which is the open network install environment that was developed by Cumulus because before we started, there was no bootloader and getting an operating system installed that was quite some work, and ONI is basically the PXE version for switches. It's fully open sourced and under the umbrella of the OCP these days, and one of the advantages that it has is, for example, you boot and it does DHCP requests, it automatically installs your OS based on the DHCP option that you get, and also the first time that the operating system boots, you can run some scripts, and making sure that your configuration gets loaded. Now, another thing for automation and something that you want to do is you want to test, and how do you actually do that? Sure, your vendor can provide you with four, six, maybe eight switches if you're lucky and you put in a big order, but typically your production network is larger than that. So what we did, since we're a Linux distribution, we released a virtual machine, you can download that separately and you can play with the operating system. Now, just having one virtual machine, if you want to test something, an entire network, that is not enough. So what we did is we used Vagrant to define a network topology, and this is a diagram of our reference topology, and you can use that to fully test whatever you're doing in your network. So you can set up your BGP sessions or clone your entire configuration. Of course, you don't have hardware forwarding, but that's not really an issue if you're just testing your protocols. Now, obviously, this reference topology doesn't match your physical topology, so what we also did is we made a topology generator, and based on a diagram, you can create a new topology that matches your network. I did it myself for a customer or multiple customers, they said, okay, we want to test, and how would it look if I would build my network, and then I just generated that diagram or the topology to test and implement everything. Now, what does that mean, and how can you actually use it, and how can you automate and orchestrate your network? Now, what I already said is you can use your standard Linux DevOps tools to provision your network. Now, that isn't very special these days anymore. Of course, being a standard Linux operating system, that makes it a little bit easier, but all other vendors have either Ansible modules or Puppet modules, so that's possible as well. Now, the next step what we're doing is basically you want to manage your network in the same way that you are developing applications. So you want to use your infrastructure as code, and then do continuous integration, so you have a complete copy in that virtual instance of your physical network. So if you have a change, you can decide for yourself if you do only that for major changes. You first push the change through your virtual network, and if it works out well with automated checkups, you can deploy it into your production network. Now, a few other things that we are also doing is we developed a tool that we call the Prescriptive Topology Manager, and what you basically do is you load a graphing file where you define all your connections in. And if you deploy your new network, you can automatically see if all the cables are connected correctly. So that means that you don't have to send an engineer to check all the connections in your data center. You just run the daemon and it says, okay, are the connections expected or are they as I have expected it? Is everything connected? One of our customers, he even created the script if he deploys a new switch, automatically the daemon starts up, and if something is connected incorrectly, he automatically opens up a ticket for his remote hands that they didn't do their work. So that is quite some, that's quite nice. Now, one of the other things is also because we develop on the routing suite as well, is that we thought, okay, how can you improve your configuration? Because if you look at typical routing configuration, you have to configure IPs on all your links and you have to manage that, which could be a lot of work. So what we did is we introduced unnumbered routing protocols. Now, if you're using Ethernet, you simply need the next hop to actually forward your traffic. So what we did is we used the technologies and the standards that are already out there. If you configure an interface in Linux, you already have an IP address, which is the V6 link local address. Now, with V6, we have router advertisements, which means that you can advertise who you are to your neighbor, and based on that, you can automatically set up either the BGP or OSPF session. Now, that means that you have a V6 routing session, but there's also an RFC to announce your V4 addresses over that V6 session, and then forward your traffic. Now, this means that there is nothing proprietary about it, and because there are, in the past, there have been some fabric configurations and such, well, I personally prefer not to use. Now, if we then look at the monitoring part, which I already explained, I think it makes it much more easier not to use SNMP anymore, so you can integrate it, get much more data from your device that isn't included in SNMP, for example, values that aren't supported by ASN-1. Yeah, that makes it much more easier to integrate that, because a lot of companies, they implement a new network, but they forgot that they have to manage and monitor it afterwards. Now, one of the things that I also want to explain is the overlay networking, and what that exactly is, because from a viewpoint, if you're not a networking guy, most people just say, give me an IP address, and I just want to reach my destination, and I don't care what you do with your network, just make sure that my packets arrive. Now, but what we want from a network side is we want to have a stable network, but the user doesn't really care about that. Now, with overlay networking, and in this case, VXLand tunnels, you can make sure that you have that stable backbone with your routing protocols, and you build those tunnels over that, so that the end user still has the same that he already was used to. Now, what you do is your server is still connected to a VLAN, and it has an IP address, and from the switch, it's hardware accelerated into a tunnel, so if you want to do 10 gig of traffic, you don't have any performance degradation. Now, how does that look from a protocol overhead? It is quite easy. Your original Ethernet frame is encapsulated in a UDP packet, and sent to the destination, and then decapsulated again. Now, I have a more complicated drawing of that as well, but from an end user, it is quite easy to use. And from a network side, so in your backbone, you just see UDP traffic, and it's also quite easy to debug. Now, how do you actually configure that in your network? I hope that you can see the left side of this slide. Probably not. But what we did for networking in IFUPDOWN2, and we added a VXLAN-aware bridge in the Linux kernel, which means that if you've ever configured a VLAN in Linux, you don't have to configure a separate bridge for every VLAN that you have on your device, and that makes it just very easy to manage. Now, on that bridge for every VLAN, you have to define a VNI. That still means that if you have a lot of VNI's, you still have to configure them separately, but that is something that we're working on as well to have the same model as a VLAN-aware bridge to make sure that you only have to define a VNI instance once. Now, one issue with overlay networking is how do you learn MAC addresses? Where is everyone in the network? You can do that with a centralized controller, but that means that you have a single point of failure in your network. Now, if you look at the RFC for VXLAN, that uses multicast replication. Now, for some reasons, a lot of vendors didn't implement this at the first time, but also a lot of users didn't use multicast in their backbone, probably because it's a difficult technology or that they simply have no interest for that. That means that a lot of vendors implemented distribution of MAC addresses and address learning through a technology that's called a head-end replication, which basically means if there is a broadcast message received on a switch, it's any cost to all the other switches that are interested in it, and that interested part is usually a proprietary application. Now, the problem is that you want to have integration of multiple vendors. Now, for that, you have the EVPN standard, and basically you use BGP to announce all your MAC addresses and show where every member is. Now, how does that actually work? First, you need to know which switch is actually interested in a tunnel endpoint. For that, you have a type three message. You have five type messages to announce everything, and this basically says, okay, I'm interested in these VNI's, so if there is any broadcast message, please send it my way. So just using type three would be able to set up your overlay network, but that means that you will run into the same problems that you have in a standard layer two network, because you still have all the broadcasts sending over your network, and that is something that you don't want. Now, for that, to solve that, you have EVPN type two messages, and basically what you do on a switch is that you run a proxy ARP. It receives the broadcast message, it translates it to a BGP update, and it advertises it to all the listeners, and it doesn't forward the broadcast message. Now, that saves a lot of broadcast traffic, and you can also do very cool stuff, like in any cast gateway, to make sure that your traffic is locally routed on that switch. Now, one important thing, for example, is the broadcast suppression, but you have to keep in mind if you have any legacy applications. So if you have anything that isn't ARP or IPv6 neighbor discovery, you have to disable your broadcast suppression. Now, luckily, most of the traffic is either broadcast ARP or neighbor discovery, and those specific proprietary broadcasts, you can usually contain them to one VNI for which you then can disable the broadcast suppression. Now, one thing is that I said you want to move VMs from one end to the other in your network. Now, for that, you have a MAC mobility extension, which basically means that if you trigger a VM move between two hypervisors, another leaf switch will learn the MAC address, and it's simply a counter that will increase for that specific MAC address, and everyone will update their tables, and traffic is forwarded to the new destination. Now, I quickly want to show you a demo. I have started up a VX instance with our reference topology. I hope you can see this a bit, and what you basically do is you start up that reference topology, you log into an out-of-band management server, and from that you can simply log in to one of the leaf switches. Now, what you get, if you unpack and install a switch from Qimless, and you install that, you basically end up in Bash, and from that you start to configure. So if you want to configure a network interface, you typically have to edit the file, and you can add your network configuration for each statement. Now, if I explain this to a guy with a Linux background, he's like, yeah, okay, this is great, and I can manage everything through that. Now, the problem is that usually I'm going to customers, and I'm explaining how to move their network to a new solution. So first I'm coming with that open networking, then they have to get used to spine-leaf topologies, which they don't know, then they have to start automating their network, and maybe even use Git, something that they don't already know. Now, that might be a step too far for some customers, and so for that we actually introduced a CLI. Yeah, it's a bit backwards because that is something that you don't actually want to use, certainly not if you have a large network. So what we did is we introduced a CLI that is integrated into Bash. So every command that you start with net, afterwards you can simply configure routing and interfaces using that CLI. So if I want to add a new interface, and I want to set an IP address on it, then I can simply do this. Now, at this moment it's not applied, so you first actually have to commit that statement before it gets active, and I can also see all the pending configurations. So in this case, it will add a network statement for that single port before I'm going to commit it. Now, one thing that, well, another vendor, Juniper is very famous about is that they have a rollback feature. Now, that is something that we also have, so you can commit your configuration and put a confirm behind it, which means that you actually have to do an action after your commit statement, which means that if you get disconnected or you lock yourself out, it automatically rolls back. Now, what I said is that this CLI is nice, but you want to automate your network. And for that, we typically, inside Cumulus, we use a lot of Ansible, and that gets a lot of traction in the world to automate your network. So what I like to see is, or what I'd like to show is a short demo about that as well. Now, what I'm doing now is I'm running a small playbook that will create a routing configuration. It will create bridges, and we'll make sure that also two virtual servers are connected. This will take about a minute or so. But yeah, the idea is that this configuration, this playbook, all its variables, you usually store them into Git and make sure that you can actually interact with your colleagues, track changes, which is much better than something that I still see daily is that you have some configuration in a text file that gets emailed to your colleague with no versioning. And in the end, there could be a typo because your colleague isn't using the version that you have intended to be used. Now, yeah, I'll open up the variable files in a bit, but I have to wait until it has a patchy install. Now, in this case, you saw the reference topology. It has four servers. Now I'm actually configuring two of them because if you get a more complex configuration, it takes even longer. Now, and in this case, you see that it just configured everything. It put a configuration in it. So if I log in to that switch again, and if I used to see a light end to do a show command, you simply see that there are switches configured, that there is a bridge configured, and it's ready for use in just below two minutes or so, which makes it much more faster. Now, there are several ways in how you can use Ansible, and there seems to be a discussion in the community what's best for a network configuration. So we actually have both. So if you want to just replace your configuration files using a Jinja template, then that is possible. So what we have is, in this case, let me see. So in this case, we simply have a Jinja 2 template, and from that you take an original configuration file, you put some if statements and for loops in it to replace the configuration file that you already have. Now, the thing is with Jinja templates, if your configuration grows, if your network complexity grows, then it can be quite complex. Now, we also have a module that talks to the daemon behind the CLI. That is an option as well. You see that from several vendors that they deploy a module to easily configure that. Now, yeah, this is what I'd like to show today. Basically, this is the part of my presentation. If there are any questions, I'd be happy to answer them. There's a microphone in the middle of the room. Do you have any questions? Just line up with the first question. Hi, you've showed how many of these things depend on NLAC or LACP to distribute the traffic. Are the solutions that look after elephant flows and properly distribute them over the different uplinks that you have? Yeah, I've got that question from a customer before, and there are some standards that look at it, but we haven't implemented anything. Thank you, too. Any other questions? Okay, thank you. Well, then let's have another round of applause.