 Okay, so the CERNET switch supports in the Linux Kernel. So who am I first? I'm Alexandre Belloni. I'm an embedded Linux engineer at Bootlin. So we were previously known as FreeElectrons. We just changed our name for various reasons. I'm an embedded Linux developer. So we are focusing a lot on open source tools and kernels, so the Linux kernel mainly. I'm the maintainer for the Linux kernel RTC subsystem and also the co-maintener for the kernel support for microchip or Atmel ARM SOCs. So we will be talking about switches and so Ethernet switches and why we will want Linux support for them. So basically what we have now are switches that have a CPU or at least that are connected to a CPU that may run Linux. So that's why we want Linux support for them. So you will have a CPU there. It will be connected to the switches through a bus. That can be MMIO, PCIe, I2C, SPI or MDIO and that will be the control bus. So that's only controlling stuff or configuration stuff that will go through that bus. And then you will want to be able to get Ethernet packets or Ethernet frames. So you also have some data bus that will probably go through your DDR at some point because your CPU needs to be able to read those frames. So the bus can be either your switch is able to directly write to the DDR or maybe you will have an Ethernet port that will do that. So it depends on how your switch is connected. But basically you want Linux support for those switches because Linux will want to configure them. So in our case we want the driver for the Microsome VSE 7514. So you have the SOC with all the blocks there. You can see that it has MIPS or CPU so it's not that powerful. It's only 500 megahertz MIPS CPU. It has all the usual controllers, UART for the console debugging, I2C, SPI, that kind of things. It also has two MDIO's controllers which is interesting because it also has four integrated files. So one of the MDIO controller is only internal on the SOC and the other one is going outside if you want to add more files. It's a 10 ports Gigabit Ethernet switch. So it can, yeah, it's not that fast but for a small embedded board it's quite fine. Currently support is done through an SDK. That SDK is provided by Microsome. They are using a user space IO, so UIO, so everything is done in user space. So basically they map the whole registers, the whole register run for that switch and just share it with user space and their user space SDK or application is doing all the configuration. So a switch usually includes a few features. They may not include all those features but usually that's what you see. So they will be bridging or switching. Obviously it's a switch. Sometimes they will support STP, Mac filtering, IGMP snooping, VLAN, tagging or untagging. Some switches are not supporting VLANs but we'll see. So ours is supporting, so we did support for that. So how do you want to support your switch? Instead of using that SDK or vendor specific tools, we want to be able to use the standard Linux tools. So IP, IF config for interfaces, IP bridge, BRCTL for bridging and the usual Linux bonding for port trunks. So that's the link aggregation stuff. So in Linux the switch ports are represented as Linux network interface. So they are just simple interfaces and then you will use the usual bridge command and everything to just add your separate interfaces to the bridge. And then your switch will be able to accelerate what Linux can do in software. So the first step is to actually have those separate interfaces and you can do whatever you want in software and then you add, on top of that, you add every, all the hardware offloading you can do in your driver. So we used switch dev, which is a Linux framework to offload features to your device, right? So what is switch dev? Switch dev is not exactly a subsystem. It's not using the device driver model. So the usual Linux device driver model. So you don't have, switch dev doesn't handle a device that is a switch, right? It is more a framework that can be used by your net devices to implement all the offloading operations. So basically you will get some switch dev ops that you can attach to your net device that you will register. And then you will manipulate some switch dev ops that are objects that will be provided to you by the network core. So the VLANs, the MDBs and that kind of things, right? So we'll see a bit more in details what are VLANs and MDBs and how everything is working. So like I was saying, the first step is to actually support your front port, so all your ethernet ports and have them separated and for each of them you want to create a new net device. So the main issue we add there is that the VSE 7514 didn't have an ethernet controller. So basically it doesn't support that, right? So it's a switch that is meant to switch, but it's not meant to be able to read or to have an ethernet controller to or from that switch, right? So there is actually an interface that is on that switch that allows you to extract the inject frames. So you can extract frames from the switch to the CPU and inject them from the CPU to the switch. So we will be using that and the fact that most of our initial configuration was basically doing that. So telling the switch to not switch and then forwarding frames to the CPU. So we, yeah, so that's the main issue because basically by default the switch is configured as a switch device. So it will switch and not forward anything to the CPU. So you have a lot of stuff to do when starting, when probing your ports just to make it not switch, right? So then it's kind of the usual stuff. So you want to register all your net devices. So we are registering one net device per port. So we create a struck-knit device. Then we add the net dev ops, etched tool ops and switch dev ops, right? And finally, one thing that is really important is to add the interface MAC address inside the switch MAC table. So our switch MAC table is able to say, hey, that particular, any frames going to that particular Ethernet address is going to the CPU port. So it has to be extracted and sent to the CPU, right? If necessary, we are also looking up the file so that later we can attach it to the device, right? So that's, we do that at the probe, at probe time. So, opening your device. So one of the callback you have to implement for your net device is .ndoopen. And that basically is the one that will be called when you are using if config interface up or if epilink set dev your interface up. And at that time you want to enable your frame reception on the port and what we did also is to enable the auto learning of MAC addresses. So the MAC, the switch will add automatically new MAC addresses to that port. At that time, that's the time when we start the file too so we are fully ready to receive frames. The other callback that you will want to implement is .ndostart.exmit so that the one when you will be sending frames on your interface. And that's where the frame injections are done on the CPU. So we have that CPU port and we are injecting frames. And then we are also configuring those frames to be forwarded by the correct switch port, right? Because we are just sending the frames to the switch and the switch then will have to select the correct port, right? So we have a small header that is forward that will specify what to do with that frame, right? So it will, so basically we have a small, it's like 12 bits or something like that to specify on which port you have to transmit the frame. So on our particular switch, the frames are transmitted using PIOs, using only one registers and we got that working. As you can imagine, that's quite slow. The throughput is like 20 megabits per second. So it's not fast, but that's the simplest interface. But, and we don't really care about that because it's not meant to be fast anyway, right? We have two other ways to actually transmit frames to the switch, so DMA, those are using DMA and will be quite faster. We didn't implement that yet, but that's on the to-do list. So we will do that at some point. So bridging, because that's what you want to do with your switch, right? So when you do software bridging, you have something like that. So you have your switch, it has four ports, right? So when a packet is coming on port zero, let's say all those four ports are bridged, then it will go up to the kernel, so the kernel will have to read it. At some point, the bridge core will select a port on which it has to be forwarded. So let's say it's port three, so then you will copy it to the switch. So basically, you don't really want to do that because it's getting quite slow and you are consuming your bandwidth between your switch and your kernel, so that's not what you want to do. But you still want to keep something working is that the basic transfer. So when a packet is coming on port two and it has to go to SW zero P2, so on your network interface, then you still want that to go through the kernel and up to user space. So obviously, if you implement hardware uploading, so that will look like that. So basically, the switch will not forward the frames to the kernel and it will just forward everything on port three and that will be it. The kernel will not even know that a frame came through the switch and everything is fine because you have offloaded everything. We see like I was saying, so that one still has to work because else your Linux interface is not working anymore. So how do you set up a bridge from user space? You do something like that. So you add a new bridge with IP link, add name, BR0 type bridge, so that creates a new bridge that is named BR0 and then you add all your interfaces inside that bridge. So you just do SW zero P0 as a master BR0 and everything like that. So you add all your ports or only the port that you want in your bridge, right? So how do you handle bridging? Because that will be working. If your network interfaces are working, that will be working, but it will be doing software bridges bridging, right? So you want to be able to offload everything on the switch. So how do you do that? Handling the interface addition and removal on a particular bridge will be done through callbacks. So you will want to register from your switch driver, you will want to register a particular callback and you do that using register net device notifier. So you will register a net dev notifier. And the event you will get is net dev change upper. So which means, okay, the upper interface containing your other interface just changed and something changed and then you get a structure that is a net dev notifier change upper info. So that will be the info and that one will tell you what actually happened. What do you need to react to, right? So there is something, an helper function for you that is net if is bridge master. So that you will use that on info and it will tell you whether the upper is a master or not. If it's a bridge master, then it's a bridge and you are actually adding your interface to a bridge. So that's exactly what you want. And if you want to know whether the interface is added or removed, then you can check info linking there and it will tell you, so if it's one, it's linking. So it's adding the interface into the bridge. If it's zero, then it's removing the interface from the bridge. So your driver just has to react on that. So I will not go in detail, too much in detail on what we are doing actually on our switch because it's really specific to that switch but that's kind of the info you can react to, right? So the other thing you want to do is under the forwarding database, right? So the forwarding database is basically is sometimes called the MAC table or that kind of thing. So it's a database, a list of MAC addresses that are present on some ports. So what does it say? It says, okay, that MAC address, that particular MAC address is on port zero, that other MAC address is on port two, that other MAC address is on port three and it allows your switch to actually know on which ports it has to forward frames, right? So what is really important here is that the kernel view of the FDB and the switch view of the FDB have to be in sync, right? Obviously, because else the kernel will not know why it sees frames or why frames are not being forwarded or that kind of thing. So it's really important to keep them in sync. So you have some user space command to dump the current bridge FDB table. So with bridge FDB show, you can also add static entries in your FDB. So if you want to say A is that particular MAC address is always present on that port, you can actually do that, right? So on the kernel side, like I was saying, the FDB and the switch FDB have to be kept in sync and it is under using the .ndo FDB callback. So you have FDB add, FDB del, FDB dump. So those are actually part of the net device operation structure, so it's not specific to switch dev, but that's also something that can be used on a simple interface, interfaces that are not switched to filter MAC addresses or that kind of things. So at first we started to work on that project. We were using the kernel version that was for the 12 and we were using the switch dev port FDB functions that were provided by the switch dev framework. Unfortunately they were removed in the v4.14 for good reason and so we are not able to use them anymore. And basically that's what the kernel version we were targeting, so we had to reimplement those. So FDB add and FDB del are quite simple to implement. Basically you get Mac country and you just have to add it or to delete it from your Mac table. So it's usually quite simple to do. The main issue you will get is FDB dump because it's more complicated. It has to handle net link messaging. So it has to handle all the net link messaging between kernel space and user space. So you will have to create your net link messages for each MAC address that is part of your Mac table. You will have to create a new message and send it to user space. What we did is basically taking it from DSA. I will have a look at what is DSA later but DSA is the distributed switch architecture. It's something that is using switch dev to handle switches too. So why did we have to actually implement those? It's because our hardware was not able to send interrupts when it learns a new Mac addresses on Mac or when it forgets about a new Mac. So it's not able to send events to the switch dev core. So we have to do that to maintain the FDB table because most of the other actually the two others because there are not many switch dev drivers in the kernel so there are basically two other switch dev drivers. They have an interrupt when adding or removing a Mac address from the FDB table. So they can actually call a notifier that will notify the switch dev core that a new Mac address is present or a Mac address just disappeared. So you have two ways of maintaining that sync between the kernel and the switch. So if you have interrupts, use the call switch dev notifiers. If you don't then you will have to implement your NDO FDB callbacks. Something also important is to manage the aging because if your switch is unlinked Mac table then it will age the Mac addresses itself and you also want that to be working on your switch rather than on Linux, right? So you want to be able to set the aging time. You have two ways of doing that from user space. So with IP link or with VR CTL and for that the bridge core will call one particular callback that is part of the switch dev ops so that's a switch dev port at set. So it will set attributes on your port and it looks like that, right? So it takes a net device. That net device will be the net device that is attached to your port, right? So that's the one that is managing your particular port. And it will send you a switch dev app. So switch dev attributes that attributes. You have the structure switch dev app just afterwards. Basically it can contain many, it can be related to many different attributes. In our case, if you see you have the one union and in our case we will get that aging time from the union, right? So and the last argument you get from the port at set is a switch dev trans. So that is meant to be a transaction function. So that function will be called twice. The first time it will be, I guess I have it there. So the first time it will be called for a prepare as a prepare, a preparation of your transaction and then it will be called for a commit of your transaction. So that allows you to actually change the configuration automatically, right? So if you have a look there, you will get the ID. So in case of the ageing times and you will get the switch dev app ID, a bridge ageing time and then you get your ageing time from the union. So like I was saying, it will be called twice. So one time it will be prepare and the other time it will be commit. Usually what you see in the drivers that they are only using prepare. So the switch dev trans pH prepare because if it's not that one, then it's obviously commit, right? So something that you also probably want to support on your bridge is STP, so you don't have to support it but it's quite better to support it. So what does that mean to support STP on your switch? It's basically reacting to state changes that will be coming from either the kernel or a user space STP daemon. So basically the STP daemon will try to figure out whether all your switches are connected using, maybe they will have a circular connection, so which is something that you don't want or maybe you will have multiple connections to one particular switch and you will have a fallback or that kind of thing and you only want to have only one that is enabled at a particular time. So that's what STP is providing. Your switch will not do those STP calculations by itself, it will just have to react on the STP state changes, right? So to do that, we use again the switch dev port at set callback and that time the ID will be the switch dev at ID port STP state. And finally you will get inside the union, if you go back inside the union, you get a U8 STP state and the values for that state will be state disabled, state listening, state learning, state forwarding, and state blocking. So basically when the port is disabled, so it's completely disabled, you are not forwarding anything to the CPU. When it is listening, you will want to forward frames to the CPU so that you can actually send for STP frames and you forward them to your STP daemon or to the kernel. Then in the state learning, you will be learning new MAC addresses on your port. And finally the state forwarding is the most important one, is basically when you say to your port, okay now you will be able to forward frames that are coming from other ports. So your driver will have to react on those states. So yeah, do whatever your switch has to do, whatever your registers have to be set to so that your switch is doing the correct thing. So for our switch is quite simple, we actually add a register for each port with those states so that almost exactly matching the states from Linux so it was easy to support. Link aggregation, so link aggregation is where you will be using two interfaces on one side like if it was one interface and on the other side, you also have the same kind of aggregation happening and so you will either have redundant link or faster link because you can actually aggregate the bandwidth. So how is it set up in Linux? So you have multiple ways of doing that. You have the legacy way using the bonding module but it's quite annoying because it only allows you to have one interface or you have the new way using IP link, add name, whatever name you want, type bond and that will be creating a new bonding interface that's the same as link aggregation, bonding aggregation, trunking also is a name that you will get for that feature. And then you add your interfaces into that particular aggregation link, right? So just say master is your aggregation link. So setting that up from your driver is basically, is almost the same as when using a bridge. So you will get a net dev change upper events like for the bridge and that time instead of having net if is bridge master, you will get the net if is lag master function that will tell you, okay, that particular upper device is a bond so you will have to add that to your particular bond, right? Again, info linking will tell you whether this is an interface addition or an interface removal, right? Something else you want to support is probably IGMP snooping. So what will happen without IGMP snooping? So IGMP is there to support multicast addresses, right? So when doing multicast, if you are not using IGMP snooping, basically your switch will, when it gets a multicast frame on one port it will just forward it to all the ports. And that may not be interesting because in that case, for example, port two doesn't have any member for that multicast address. And you don't want, it's a waste of time to forward frames to that particular port. So you actually want to do IGMP snooping so that Linux can install new FDBs or new MDBs in the Mac table so that you get to that state where frames for one particular multicast address are only forwarded to the ports where you have members of that multicast address, right? Or multicast group. So like I was saying, Linux can actually install those MDBs inside the switch to avoid, to avoid floating multicast traffic on all ports. For that, you will want to forward IGMP packets to the CPU. You have an helper for that. It's a netdev for each MC adder. So that will, for a particular net device it will give you all the multicast addresses for that particular device. So you will have to install all those addresses in your FDB and those addresses will, you just want to copy all the frames going to those addresses to your CPU. And the other thing you will have to do is to actually let the Linux kernel add MDBs inside your Mac table. And for that you will use the switchdevport.obj add. And that one is a callback that is registered inside your switchdevops that are attached to your netdevice. So the callback looks like that. So you also have a netdevice that is an, again, that's the same. It's a netdevice for your port, so for your switch port. And then you get as a parameter you will get a switchdevobj and it's again a transaction. So the third parameter is the transactions that can be prepared or commit. So the switchdevops looks like that. You will see that it's not much and you will not be able to do much with the switchdevops itself and you will actually have to cast it using that kind of macro there. So the switchdevobj port MDB will use container of to give you the actual MDB object that you want. So that's, in that case, that will be a switchdevobj port MDB object. And that one, if you see it, it will be containing the MAC address and it also contain a VLAN ID. That's what you want to add to your particular MAC table for that particular port. So, like I was saying, so the OBJ ID will be switchdevobj ID port MDB. You cast it and then you will get the MDB add and MDB VID. And that's what you want to add to your MAC table. So at that point, you will install all your MDBs and you will have offloaded IGMP to your switch. So one of the other features our switch is supporting is VLAN filtering. So that one is more complicated. It's just an example that basically the setup we are using to test our features, but it doesn't make a lot of sense, right? So basically, how do you configure that? You create your bridge. You say, okay, my bridge is able to do VLAN filtering. I want to enable that. You add your interfaces to the bridge and then you add some VLANs using the bridge command. So you add VLANs on each interfaces. In some cases, we are just using simple VLANs. So where incoming traffic and outgoing traffic will keep those VLAN tags. And in some cases, we will just remove the tags on ingress and egress. So that's where you want to use a PVID untagged, right? So you will have an internal VLAN inside your switch, but it doesn't go outside, right? So in that case, it's like with the MDBs. Instead of the MDBs, we will get a VLAN object. So we will use the port object at callback. The object ID in this time will be switchdev object ID port VLAN. And you will be able to cast it to a VLAN object using that macro and the VLAN object looks like that. So basically, it has a few flags and you have a first VLAN ID and the ending VLAN ID. Quite often, you will only get one VLAN ID there. So begin and end will be the same, but it's also possible to under a lot of VLAN ID. So you will have to iterate over all those IDs. Regarding the flags, the flags will be a combination of all those flags. So that's where you get the PVID and the untagged flags. So that's where you want to properly configure your switch depending on the flags that are there, right? So our particular switch was able to strip tags. So VLAN tags, it's also managing, it can also handle up to three VLAN tags on the same ethernet frame. That's not something we are supporting now, we are only supporting one VLAN tag, but it will do everything in hardware. So it will add that tag that is completely offloaded. So it will add the tag or remove the tag depending on how we configured it. So it's really efficient on that side. Finally, I want to talk about DSA. So DSA is when you have something like that. So you will have the control bus there that is going to the CPU, and then your data is flowing through an ethernet controller. So it doesn't have to, but basically all the drivers that are using DSA are connected like that. So basically there, your control bus can be SPI, MTIO, I2C, whatever. And between your switch and your CPU, you are using ethernet, which is, it's both nice and not quite nice because you will, so your switch will waste one port, one ethernet port to talk to your CPU, right? That's the first kind of switch support we had in Linux. That's basically the ones that are coming from all the OpenWRT supported switches, right? The ones that you get in the small VLAN switches. So that's the kind of thing they are doing. And what you can see there is that you can actually have a DSA port there that will talk to another switch, and you can actually have many switches like that, right? So that's one of the main points of DSA is that you can actually cascade multiple switches like that, so which is really nice, especially since you are missing one port, so you can add other ports by adding another switch. So like I was saying, DSA is a distributed switch architecture. It will handle chaining switches, and it's also under a vendor-specific switch tagging protocol. So because you are connected through ethernet there, you will have to add a switch, a particular tag, so your frames are actually forwarded to the correct port, right? So when the CPU want to send a particular packet to, let's say, SW0P2, it will add a particular tag inside the ethernet frame so that when the switch gets that frame, it will just forward it on the correct port. So that time, that one, DSA is actually integrating nicely with the device model, so it provides you with a switch device that has switch ops and that is actually nicely integrated in the device model, right? So it also has a well-defined device rebinding, so that's something that is quite standard. On the outside, we had to redefine the bindings. I tried to keep as close as possible to what DSA was doing. It was not that difficult, but that's something that I had to do. So all the device reparsing that could have been done by DSA, I did it in my own driver. So DSA versus switch dev. So DSA is actually using switch dev, right? So it's not completely different. But you will have, when writing a driver, you will have to select which framework you want to use. Do you want to use DSA? Do you want to use switch dev? And you have multiple questions to answer for that. So is a switch connected to the CPU through an internet interface? Is that the case? So most likely you will be using DSA because it most likely means that you will have a switch tag, right? So if you have switch tags with vendor-specific protocol and that kind of thing, then maybe you will want to use DSA. And the other one that is important is can your interface, so the interface that is connecting your switch to your CPU absorb all the traffic from the switch? So basically what you will have is that when you are using DSA or actually all the switches handled with DSA, they are participating to the traffic, right? Whereas the switch dev, so the switches handled by switch dev are more like top-of-the-rack switches, right? So they are not participating to the traffic. So traffic is going through the switch but not to the CPU. And that's basically one of the reasons why we use switch dev because we will be switching like up to 10 gigabits and our CPU can only get 20 megabits. So we don't want to participate to the traffic. We want to offload as much as possible, right? So the other challenges we had writing the driver is that the switch can also be used. So the switch, like we used it, was connected to the MIPS CPU. So we had Linux running on that MIPS CPU. It was connected using MMIO. So it was directly on the SOC bus but it can also actually be connected through to an external CPU using PCIe or MDIO. And basically we use device tree to define that switch, right? So we used everything that is provided by device tree. So like the, how do you get your file or that kind of thing. So we used everything that was device tree specific. And when you want to connect that switch using PCIe, especially on x86, then it gets quite more difficult to actually use device tree. The current solution is to have an MFD driver registering all the other drivers as platform devices. That's kind of the comeback of the board files but that's what we have to do. And the other challenge is that all the registers are packed in the register space. So they actually have multiple switches that are using exactly the same IPs but with a different number of ports. And because you have a different number of ports, you have a different number of registers and they packed everything. So instead of having some space in between registers, so basically we have to handle all the offsets and that kind of thing depending on the number of ports we have, which is really annoying, right? So yeah. The next steps, sending patch upstreams. That was supposed to be done for this week but most likely will be done next week or something like that. So which is, it's really something we are actively working on. It's actually the goal of that project is to stop using that vendor-specific SDK and start using a real Linux Switch Dev Driver. DMA is on the to-do list. Like I was saying, using DMA, we will get quite faster transfer between our switch and your CPU but the CPU port itself is still limited to one gig so we will never get the full bandwidth from the switch. But that's fine. PTP, so the Switch support PTP. It also supports TSN and that kind of thing so we will definitely want to work on that. It also supports some quality of service, sync E and that kind of stuff. And finally, we have an issue with the PromiseQS support because basically we are coping to many frames when activating PromiseQS on the bridge so we want to stop doing that and it's quite difficult to do with the Switch hardware itself now. So that's something we need to work on. And that was it. So I hope you learned something. Do you have any questions? I have a mic there, a question there. So a lot of these Switch Ships are very complicated. They have all sorts of additional features. Like sometimes they have layer three, layer three routing functionality or support for eight for access control lists and so on. Is that something that the Switch Dev framework will support, eventually supports, does support? Yeah, well, so I didn't really have a look at much more in the Switch Dev framework yet but yeah, basically it doesn't support much more than that because for example, for the objects you only have MDBs and VLAN right now. So you almost have all the API described there, right? But I guess it will be supported at some point, especially since Switch Dev, so DSA is most likely to be used on embedded switches, small switches and you will have Switch Dev. Switch Dev is driven by Melanox and Netronome that are making really huge top of the rack switches, switching like, if you are not switching 40 gigabyte per port, then you are not switching anything right. So they have a lot of features and they are really working actively on it. So I guess we will see a lot of more features supporting in the future. So question there, Thomas. With your aggregate support, were you able to use all types of aggregate bonds such as Active Passive and 802.3 AD and did the traffic have to go into the user space or were you able to keep it all in the switch space? Yeah, so our switch was using 802, I don't remember which one, but the classic one, right? So it actually supports four types of aggregations but the one it supports the best is mod for Linux, so the default one I guess. I don't remember exactly but everything is kept inside the switch. The switch itself will generate the Ethernet frames to keep the link alive and that kind of stuff. So everything will happen on the switch, right? So it will tell the CPU whether the link is down or not. Which is nice. So question, nope, yeah. So three minutes and it's a closing game, so let's have fun. Thank you.