 Hello, everyone. I think we can start. So, welcome. Let me first introduce myself. So, I'm Maxim. I work at Bootlin. And I mainly work on networking drivers. So, I've worked on network-fi drivers and network Mac drivers, especially on PPV2, which is a controller for Marvel platforms. And during my work, I have implemented support for classification in PPV2. So, today I'm going to describe a bit what classification is. What is already available in the kernel in terms of software classification. What operations we can upload to the hardware. Why is it useful? And give you a quick example on how this is implemented, really, on the hardware side, inside a network controller. So, yeah. Nowadays, this is something that we are starting to see more and more in the embedded world. Classification and offloading classification is not something new. It has been around for a long time on the server world. So, we've big companies and manufacturers such as Melanox or Intel. And this used to be done by firmware. So, this was completely hidden from the Linux world. Today, we are starting to see some components where we have to interact with this from the Linux side. So, the goal of this talk is to, yeah, see what kind of challenges we have to deal with these technologies. So, first of all, let me first give you a quick reminder of what classification is. And I am going to focus on the ingress side of the traffic. So, when we talk about traffic, we have either egress traffic, which is something that is going out of the machine that we are working on. And we have ingress traffic, which is inbound traffic. So, we are going to focus on inbound traffic. So, a quick reminder, what happens when we receive a packet from the outside world? Most of the time, we have a networking PHY. So, the PHY is in charge of decoding the physical signals that transport the information we want. It can be a fiber optics, it can be a copper cable. So, the PHY will decode that and transmit it to the MAC, which is most of the time located inside the system on the chip in embedded systems. It can also be on PCI cards and so on. In case of offloading, the offloaded operation, so, things that hardware will do for us, this is most of the time done before the MAC transfers the packet to the memory. So, this is done before the DMA step when receiving a packet. When the MAC is done receiving the packet and copying it to memory, it will raise an interrupt to signal the CPU that we have an incoming packet that needs to be processed. A quick zoom inside the MAC. So, in our case, we have a packet processor. So, this is not on the schematic, but this is a part of the MAC that is dedicated to packet processing. So, it will try to look inside the headers of the packet and see if it can do some operation to offload some things from the software world. And once the packet is received, as I told, it's placed in memory using DMA. And we have a mechanism of receive queues, also called receive rings in some cases. On very simple network controllers, we only have one receive queue. But on complex controllers, we have multiple receive queues. And this can be useful because we can pin some interrupts to these receive queues. The main interest in our situation is that we can make sure that one given core of our CPU will be processing packets from a given receive queue. So, this will help us spread the traffic across all of our CPUs instead of only using one and using software mechanisms to do the load balancing afterwards. So, I'm going to talk about this a bit later. There is a mechanism that is called RSS that is useful in this scenario. Once the packet is arrived and we had the interrupt, the software world now takes control. So, we have an interrupt handler inside our driver. Most of the work is done in soft IRQ context in networking. So, the hardware interrupt handler will try to do as little work as possible. It will just signal that we have an incoming packet that has come in. And then we have mechanisms called NAPI that will do some interrupt coalescences. The main idea is that when you receive one packet, most of the time you will receive a bunch of them in a row. So, instead of having one interrupt that needs to be handled per incoming packet, we group that together and use some polling mechanisms to read these packets instead of functioning in an interrupt-based scenario. So, this helps with the processing throughput. After that, we have a step that will be the main focus, which is the TC subsystem that allows us to perform some very specific action on our packet based on the content of the header. So, the normal route without TC is to go through all the handlers for all of the protocols that are present in our packets. So, if you're familiar with the way a typical network packet is built, we have several layers that are encapsulated within each other. The autonomous layer that we have to worry about is the layer 2. So, in our case, this is the Ethernet layer. In that layer, we will have information about the MAC addresses and the VLANs, for example. So, we have generic handlers for that inside the kernel to deal with the VLANs and so on. Once we'll dealt with the layer 2, we can deal with the layer 3, which is most of the time IPv4 or v6. We can do some routing here based on the IP addresses that are present in the header. And then we will deal with the layer 4. The most common ones are UDP and TCP. So, we have information about the destination port and source ports. And at that point, when we handle the layer 4, we will try to associate the IP destination address and the IP destination port, the TCP or UDP destination port to a socket that was opened by user space. And from that point, we simply give the content of the payload to user space, and user space will deal with all of the other layers that might be present in our payload. So, the kernel only deals with layer 2, 3, and 4 in the most common cases. So, dealing with all of these layers, this is great. We have a very generic networking stack. If we want to do some very specific operations, such as dropping packets that are not interesting, relying on these helpers can cost a lot in terms of CPU time. So, we have a hook, which is the TC sub-system, which allows us to do some filtering before going through all of these stacks. And this is what we are trying to offload in our scenario. So, all of these pre-processing steps before the layer 2, 3, 4 handling can be offloaded to hardware. So, to do that, we have to look at the content of our headers inside our packets. This is not an easy task, because we have a lot of existing protocols. And this makes so that looking for specific information is difficult, because we cannot simply just look at an offset inside the array containing the bytes of our packets. An example is when we have VLAN tags. VLAN tags will enlarge the size of the layer 2 header, and it will therefore shift all of the other headers by 4 bytes. So, if we want to get the information destination port for TCP, we first have to know do we have a VLAN tag or not, before trying to look up for this operation, for this information. So, this step of extracting all of the relevant information from the headers is called dissecting the packet. So, yeah, here are several examples of various kind of packets. So, we have all the three same protocols, except for IPv6. But as you can see, with just simple differences, we have a very different layout of our packet. So, yeah, classification in our case is simply the operation of identifying the packets that we are interested in to perform some operations, such as dropping them very early on or rerouting them to another interface. Most of the time, we talk about flows when we deal with classifications. So, a flow in our case is a group of packets that share the same values for some attributes. Common flows are two tuples and five tuples flows. Two tuples flows are groups of packets that share the same source and destination IP addresses. So, packets that belong to the same flow will come to the same place et go to the same place. For five tuples, we also look at the TCP and UDP ports. Also, the layer for protocol. So, are we using TCP or UDP? And packets that share the same five values for these attributes will most likely go to the same socket to user space. So, dealing with these flows is an easier way to handle traffic steering and controlling. So, we have two main interfaces to perform classification from user space. Well, two main interfaces that allows for hardware offloading. There are several more. The most powerful one is TC flower. So, the tool TC is used for traffic control. It's a very powerful tool that allows to affect both ingress traffic and egress traffic. It's used to limit the rate for several flows, for example. So, you can do very specific stuff such as limiting the rate for a given flow. So, for example, you can prioritize traffic with this tool. So, make sure that, for example, the traffic that is going to your web server is handled with higher priorities and all other kinds of traffic. And so, with TC, you have a wide variety of what we call filters. So, filters are basically classifiers. Most of them have software implementation. And the flower filter is special because it allows having both software implementations and hardware implementations for the filters. So, the TC flower filter will decide by itself whether or not it will offload the rules to the hardware or do it in software. So, this is an example of how to use the TC flower filter to perform a very simple case of classification that would be dropping all traffic going to port 80 on a machine. Yeah, but you can do some much, much more complex things with kind of a build some kind of trees of action and classifications. So, you can group the traffic with a filter and apply actions on this group and then refilter behind it and apply a more specific case of filtering. So, TC is very, very powerful, but the drawback is that it can be very difficult to express such complex rules when we want to offload something to the hardware. The main other interface is Eastool. So, Eastool is a tool that allows you to directly interact with your networking driver. And with Eastool, you can also express rules that you will want to be offloaded inside your controller. So, with Eastool, you can only do some hardware classification, not software. So, you don't have a tree-based structure here. You simply have a big table of rules and the most, the uppermost rule will always take precedence. There is ongoing work to have a third interface that supports hardware offloading, which is net filter. So, a big issue with these two interfaces, TC Flower and Eastool, is that both of these representations were different when you were writing a networking driver. So, you had a specific set of callbacks for configuring TC Flower and another set for Eastool. And you had to basically duplicate the work each time you wanted to support Eastool or TC Flower. Now, they wanted to introduce a third way to offload the traffic classification with net filter and they decided to unify this internal representation into one which is called the flow infrastructure. So, yeah, this is an ongoing work that is currently being upstreamed. So, soon you will be a third slide with net filter here. So, why do we want to offload the classifications? There are pros and cons. The pros, of course, is that you will reduce your CPU load. So, instead of having your CPU unpack the headers of your packet, look at the different attributes and decide whether or not you should do something specific with this packet, it takes some CPU time. In embedded systems, the CPU is valuable and also you have to keep in mind that we can deal with 10 gigabits per second interfaces and the CPU will not be able to perform classifications at that throughput. So, in some cases, if you want to have what we call line rate, you have to offload most of the work to your hardware. As I said, this can be also useful to spread the traffic across the CPUs because the CPUs that handles the first incoming interrupt will be the one doing all the work inside the kernel. So, by default, this would be only one CPU if you only have one receive queue available and you have to use classification to spread your traffic to all of your receive queues and all of your CPUs. So, this is an easy way to make use of all of your CPU calls. You can also do some switching with that. So, you have some infrastructure such as Switch Dev, which uses these kinds of offloading to implement switching. Yeah. So, this is basically why you would use hardware offloading, but there are some cons. The big issue is that if you decide to drop some traffic inside your Mac, your kernel will never know that you had this packet incoming. So, it will never see it. Your counters will be off because you won't have your counters up to date. If you want to detect new flows in some cases, you have to make sure that the first packet of one flow arrives to the CPU before deciding what you do with that flow, if you want to drop it or not. So, yeah, you have to make sure that you are okay with the fact that your system will be possibly missing some incoming packets and the information that you had something incoming. So, the hardware design for these kinds of packet processors is a bit similar of what you find in software. So, first of all, you have to extract all the information of what is inside our header. This is done by what we call the parser. So, this is the equivalent of the dissecting step in software. So, the parser is a component inside your packet processor that will extract all of this information. As I said, this is not straightforward to do because you have to look inside each and every layer to detect the various offsets of the attributes that you are interested in. Once you have parsed your packet, you can then use some classification engines to decide what you will do with it. You can have a polisher. So, the polisher is in charge of limiting the throughput, if you want to. And then, you have all the classic step of doing your DMA transfer and then queuing to your receiverings. The parser is very interesting because you have to use a special kind of memory to do so. Keep in mind that you want to be dealing with 10 gigabits per second traffic or higher. And to do so, we use most of the time what we call a T-cam. So, in the same way that you have some S-RAM or DRAM, you have the T-cam, which is another kind of memory. This is a memory that is addressed by value. And this is not a binary memory as you are used to. This is a ternary memory. So, you will be matching some patterns inside a key. So, the key is simply some bytes that you will extract from your packet header. And you will try to match some patterns inside that key. So, what you can match is either a 0, a 1, or what we call an X, which is whatever value. So, this is why this is a ternary memory. You have 0, 1, or whatever. Once you have matched something with your T-cam, so, the T-cam takes a lot of place on Silicon die. So, this is something that is very expensive to put on the Mac. So, you have a very limited space here. Once you have matched something, the T-cam will return the index at which the match was found. And this index is then used to perform a standard SRAM lookup inside another memory. And this will contain all the attributes that are associated to this match. For example, if you want to detect whether or not you have a villain tag inside your L2 header, you will match the specific value for the header type that corresponds to I have a villain tag. And you will associate to that match the information I have a villain tag. So, this is a bit that you will simply flip to indicate, okay, I have a villain tag. So, in the parser, the T-cam SRAM memory configuration is something that you will iterate on for every packet. So, you will have a first match that will detect whether or not you have a villain tag. Then, you will have a second match that will detect whether or not you are using IPv4, IPv6, or something else. You will have other kinds of matches to detect whether or not you are dealing with fragmented traffic or not. Then, you will try to detect whether or not you are using TCP or UDP. And, as I said, depending on the result of each match, your header size will change. So, the offset at which you will find your next interesting attribute will depend on the result of the match. So, at each match, you will update the offset for the next match. And, in that way, you have a way to basically crawl through your header and each time accumulate new informations on the content of the packet. Only the header. So, with this kind of iterating thing, you can dissect your packet at 10 gigabits per second with these technologies. So, this is very internal to your back. At that step, the software implementation has not come in place yet. This all happens before the interrupt is fired. So, then, using this information, the classifier will use all of this to detect what to do with the packet. So, classifier, most of the time, a hardware classifier will also have a lot of tables that you have to configure. Inside these tables, you will try to match some specific information. So, now that you have identified where we can find the TCP destination, the TCP source port, IP destination, IP source, you can build some keys in a reliable way, regardless of some specific details such as whether or not we have a VLAN tag. So, in the classification step, we will extract all of the relevant information and try to match them with some rules. So, at that step, we will decide whether or not we are interested in traffic going to port 80, as I said previously as an example. So, you have several engines that are used for classifications. You have engines that are based on T-CAM technologies. So, it is basically the same thing as a parser. You have engines that use hashing mechanisms. So, you will have, in that case, a fuzzy match because since you use a hash of information from your header, you won't be 100% sure that this is the flow that you are interested in too. But this can be interesting to do something such as spreading traffic across all of the CPUs with RSS. And then you can have classification engines inside the hardware that have basically a small CPU dedicated to classification for that. So, you will write rules with if, then, and else constructs. And then, based on all the results of the various engines, you will decide what you will, in the end, do with your packet. Do you drop it? Do you forward it to the software? Do you redirect it to another interface? So, this is what we are trying to decide at that step. Let me just dig a bit deeper into the RSS thing. So, RSS, also, is not new, but I consider it as a special case of hardware classification because we still need to extract information from our header. So, RSS stands for Receive Side Skating and it allows us to spread the traffic across multiple CPUs. So, we cannot simply say, okay, I have an incoming packet, I will route that to CPU 0. Then, the next one to CPU 1 and the next one to CPU 2 and do a round robin kind of stuff. You cannot do that because if the packets belong to the same flow, you will hand up with packets that needs to be processed by the same user space process, being processed by a lot of various CPUs and then you will have, in the end, to do some locking to make sure that all the packets are in the end dealt with in the correct order by user space. So, what you do instead is that you extract information to which flow the packet belongs to and then you will spread the traffic depending on this particular flow. So, you will associate an identifier to any given flow using a hashing function and then based on this hash, you will assign a receiving to this packet. So, this guarantees that all the packets from the same flows are always handled by the same CPUs. So, you don't have any reordering issues to worry about. Your cache and locality issues are dealt with. So, this really makes for great, great performance improvements for routing, for example. So, this is how you configure RSS. This is done with Eastool. In ppv2, we find all of these technologies. So, yeah, ppv2 was the base for the architecture of the parsing and classifier stuff in these slides, but this is very similar in other networking controllers. In very high-end controllers, this is the same technologies, but in a much bigger scale. So, in ppv2, we have a TECAM parser that has 256 entries. So, you can specify 256 different matches for the headers. You have 512 instructions for the classifications and you have four different classification engines. An interesting thing in that is that the C3 and C4 engines are so complex that we just cannot support them with the restricted APIs that we offer to user space. So, this is something that is common. We have to limit what we do with our classifier because this is a very generic way of doing things and this is too powerful to be entirely under the control of user space for now. So, yeah, in our situation with ppv2, we can do pretty much everything. So, we can drop the traffic at this stage. We can redirect it to other interfaces. We can choose to which receive queue the flow will go to. And we can also... So, what I call steering to RSS table means that you can target a specific flow and spread it to a lot of CPUs. So, in that case, you will be doing that with two tuple flows. So, you can say, okay, all of the traffic towards a given IP address will be handled by all of the CPUs and all of the rest will be handled by only one. So, you can prioritize things. One drawback is that all of these resources is shared between all of the physical ports on a machine. So, when you are doing some TC or if tool configuration, you are specifying to which interface this applies to. And in our case, we have to manually separate all of these flows between the different interfaces, physical interfaces inside the driver because all of these resources, the TCAM and the classifier takes so much place that it's only present once on a die and you have to make sure that when you specify a rule on a given port, it won't drop traffic on the other interfaces. This can be problematic. So, in ppv2, we support RSS. We support steering on 2 and 5 toples. We also support steering on the VLAN tag. So, this allows us to perform quality of service classification. So, we can prioritize one VLAN over others, for example, making sure that traffic to destined to one VLAN will be handled on all CPUs and all of the rest on simply one. We have Mac and VLAN filtering. So, this is the step of where we detect whether or not the VLAN tag that is present in the packet is interesting for us or not. So, if we receive a packet that doesn't have the correct destination MAC address, you would normally drop it because it's not interesting for us. If this is done in software, this is a loss of CPU time. So, we can do it in hardware. Same goes for the VLAN tag. And there are no firmware involved for all of these configurations. So, this is all done in the driver. So, if you want to see how you configure a T-CAM, you can have a look at the PPV2 driver. And as I said, we only use two of the engines because the other ones are way too complex to implement. So, hopefully, you learn the finger too. You have to take into account a lot of things when you do some classification of loading. Most, the most obvious thing is that in this situation, the hardware will be doing stuff behind the CPU's back. So, if you have a bug or if you have a misconfiguration, it can be very hard to troubleshoot. There are ongoing efforts to try to report all of the statistics that the hardware reports to user space. So, there are discussions about that. Most of the time, the performance improvement makes it worth it, especially in the embedded world. Yeah. And we are starting to see interesting stuff where this is not done by a firmware. So, in my opinion, this is a big win. So, if you have any questions on that, I would be happy to answer. Yes. Hi. Thank you for your talk. You mentioned there are several user space interfaces to do or to configure the filtering. Is these all handled by one and the same callback in the driver or do you have to implement several callbacks? So, nowadays, you have to implement several callbacks. So, you have the TC hooks, you have the Istool hooks. And this is tricky because you have to either ignore the fact that you have different sets of callbacks in your driver and assume that your user will be only using one set of them because you can have contradictory rules that are given by TC and Istool. So, for now, this is something that is being worked on with the flower infrastructure work and hopefully one day this will be a unified interface. But this is done by the infrastructure in the kernel. The cleanup of everything is being worked on is in the infrastructure of the kernel not part of your driver. Yeah. So, yeah. For now, we have to deal with several sets of callbacks. Thank you. Any other question? In that case, I will ask a beginner's question. If I use VMware or other virtualization, I often end up with promiscuous interfaces. Could that help? If you use, sorry? Bridged into VM. Yeah. Could this technology help to not end up with a promiscuous isn't it interface on the host? So, there are ways to use classification of loading where you have some VMs without using promiscuous mode with Vientario and so on. I'll follow up on that question. If you have like these DSA switches and you configure them in bridge mode, then the CPU interface ends up in promiscuous mode. And that's not very nice. Could that be somehow fixed by the classification maybe on the CPU interface side? Yeah. So, I don't know in that case for switches. Well, the problem is the Linux bridge code which switches the CPU interface into promiscuous mode. Yeah. Thank you for your talk. It was really, really interesting. And I wanted to ask a question about hardware counters because the packets filtered out on hardware level you can never see them. Is there any standardized way how to access this sort of information hardware? Is there currently information hardware about that and is there some standardized way to access it? So, for now what you do is you use eStool to ask your network driver to interrogate the hardware about the control values. So, this is the interface that you have to use now and this is not always implemented by your driver. So, yeah, as I said there is an ongoing work to try to unify that especially with the eStool using netlink now this will simplify a lot of stuff. So, yeah, this is something being worked on because this can be, yeah, it's not helpful to troubleshoot if you cannot see even that packets arrived. Thank you. Well, I think we are out of time anyway. So, thank you.