 Okay, so I'm an engineer working at Red Hat on networking, various stuff, it's quite dynamic job working every year on something else. But today I'm going to present some interesting stuff that's actually in there in the kernel for years and it's probably going to be there in years to come. Before we start digging through the topic, we'll first have to make here some general principles behind it. I hope all of you are familiar with that, so just let's get through it quickly. So to really get a traffic in Linux, you need something called networking interface. So it's a network interface, that's something that has a name, it has an ID, and you can actually list it with the comment IPA or maybe IF convict, that's the old one that you should not be using anymore. We have two main kinds of network interfaces, those are physical or virtual, physical or represent a NIC, like the physical networking card, that's the slot you have at your laptop, you can plug and cable in, or a Wi-Fi interface, virtual, that's something that does not exist physically, but that is still useful, for example, VLAN interface or a software bridge or whatever else. Now interfaces can be stacked, which means you can put interfaces on top of each other, not all interfaces, but for example, a bridge has its ports. So if I have two ports in my server, I put them into a software bridge or a bond or whatever, I have now interface that are stacked on top of each other. I can VLAN on top and so on. So IPA is the comment to list network interfaces, here is the output, an example output, you see the ETH0, okay yeah, you can see ETH0 as an example interface, of course there will probably be more interfaces listed, this is just an example of one of them. Now let's say you executed APA and you don't see your interfaces or your interface, what's going on? Okay, just maybe it's hiding, sorry, yeah, where can it be hiding? So the first thing that it can be is a namespace. What is a namespace or network namespace to be precise? It's a kind of networking container. So let's say I just like spawn another networking stack including all that it contains. So it's kind of, you can really think of that as a container for networking. In fact, if you're using containers, be it Docker or anything else, it is probably, it is actually using namespaces as a part of their setup to isolate themselves from the rest of containers, networking wise. What is important for us is that a single interface is in exactly one networking namespace. It could be the root one, this is the one that is started by default when the computer boots or it can be another one that was created later. So how to list network namespaces? There is IP netns command. If you start it, it will just output existing network namespaces one per line. Here you can now use IPA command to explore a particular network namespace by giving it as a parameter to the dash pad. Let's see an example. This would be an interface in the my netns namespace. As you see, the output is really the same as we saw before, but now we're listing this particular namespace. Again, one interface cannot be in two different namespaces. You can? Yes, you can, but I will actually drag you to my pages, we don't have time to cover all of that here. Sorry. Can you please answer any questions for me? Oh, yeah. Thank you. Thank you. So the question was that you can move interface between namespaces. We have another option to actually list all network namespaces all together, including all the interfaces they contain, and it is a plotnet CFG tool. I had a talk five years ago about this tool here at DevCon, so if you're interested, watch YouTube. Okay. Here you are. So let's say you explored all network namespaces, and you still don't see your interface. What's going on? Well, maybe, just maybe, it doesn't exist. How could that happen? Well, it's because no driver claims that interface, like driver in Donimus Carmel. Why could be that the case? Well, maybe you don't have a driver for your networking card, or it's not loaded or unloaded, or maybe it is loaded, but the interface is taken by something else. What could that be? Yeah, there are a few options. First it could be something called DPDK, driver, something, developer, and kind. I never remember the acronym. It's kind of user space drivers and libraries. So that means that the kernel is not in charge of the interface anymore, but a user space application or program is. We can list that. We can watch that or see that with DPDK, DevBind, the chess comment, and this is an example. Now we see that this particular interface, this is the PCI ID. That's the number of the card or the PCI was, is actually taken by DPDK. If this comment returns not found for you, you'll probably find you don't have DPDK installed and you might be safe. Another option could be virtualization. That's cool thing nowadays. Everybody is using virtualization. So yeah, it has some interesting features that are perhaps not that widely used, but they are used sometimes in some specific cases. One of those is called PCI path through, which means that instead of emulating devices to a virtual machine, the PCI device is actually assigned to a virtual machine. So the virtual machine is controlling that single PCI device by itself without any intervention of the host kernel. I will not, I will not tell you any comment to list that because we have several virtualization solutions. Those are different. So if you suspect this is the case, just look up the configuration of your preferred virtualization solution. Although there's one comment that might help you, that is LSPCI. LSPCI actually is all interfaces, or sorry, no, all PCI devices that the system sees, including those that are assigned to path through or to DPDK or whatever. Okay, so let's say that we were successful in finding our interface. We know what its name is. We know where it is. By the way, one more thing. There is a huge tree mounted under slashes, which gives you on the kernel internals. If you are a bit skilled in going through there, it's not like an easy thing to browse or to walk through, but you can get used to it. It also contains mapping from PCI ID to a particular network interface. So this might be of another help to you. I will not get into details there. So let's say we have our interface now. So what happens now? Where are the packets going through? What can happen with that? Now, what I will talk about today is really a simplified view. So those of us, those of you who are skilled in networking will have to excuse me. I'm doing some intentional simplifications in order for this talk to be more edible by you. I actually went into more details two years ago about how packets are flowing through the kernel. So if you are interested in still simplified, but more detailed talk about how packets are flowing through the kernel, then, again, watch YouTube. But so we're not going into the packet flow today through the kernel. We instead focus on those points where the packets can be stolen, redirected, whatever. Here you have a simplified receive path, RX mean receive in networking terminology. So receive path, there's a nick. This is network interface controller. This is the hardware card you have. Then the packets flow to the driver in the kernel. Then packets are passed to something called traffic control, which is a feature of the kernel that we will talk about in a while. Then packets flow through the stack of interfaces, if there is some, there is one. Then finally they reach the TCP IP stack in the kernel. And then they reach a socket, that means your application. So what can go wrong on this long path? We will see. We have also the opposite direction, that is a transmit path. It's basically the same in the opposite direction. So socket, TCP IP stack interfaces, actually, I could insert them here as well, but this is not that important for us on the transmit path. Traffic control driver, the card. So let's go through them one by one, starting with the receive path. First thing is the hardware, your networking card. It has some interesting features that you might not be aware of. First, a lot of, maybe even most of, modern freaking cards actually have an internal bridge, meaning a switch, cheap packets. Why? Well, maybe they have multiple ports, you can plug in multiple cables in there, or maybe they don't, but maybe they can be actually presented to the computer as several different kind of physical cards. So this is called, the second, the latter case is called SRIOV, and it actually allows the hardware, the NICAR, to present itself as several PCI cards to the operating system. And there is internal bridge, which is deciding to what this interface or what this PCI device the packets should flow to. How can we find out? Well, it's LSPCI2, you already talked about it, and it could show us. Now I see this internet control has actually two virtual functions available. Now how the bridge is configured in the networking card is really out of scope of this talk, but it's, well, it's basically a bridge, so if you have some experience with like top of your X switches, those devices that do plug into a network and like plug in cables into that, it's really similar. So you can think of your networking card as a physical bridge attached to the network with multiple cables going through to your computer. So what might be happening if you don't see your traffic? Maybe there are virtual functions, the SROV devices configured, and maybe they are getting your traffic. So okay, so we have a single virtual function, all the virtual functions are down, or remember pass through. The typical use case of the virtual functions is pass through to virtual machines. But okay, let's say that we got that out of picture, so what we have, what else we have in the hardware. Again, hardware, especially the networking hardware nowadays is pretty smart, it can do a lot of things. It contains internal flow tables and filters and all of that stuff, so it gets only the traffic you need to. This is usually not a problem. Unless the conno developers screwed up, this should just work for you, hopefully. So what we are more interested in is hardware drops. The hardware, the card, can indeed drop packets, for example, because its queues are full, or there is some kind of other error. The good news is that we can see this. ETH tool, dash capital S, command is our friend here, it will actually ask the driver to query the hardware to give us some interesting statistics. So here is an example of that. You'll see Cardi ports that it received 13 frames that actually did not have a valid control check sum, so they got dropped before they even reached the driver. The full output is very long, you can try that on your laptop if you want. And there is a lot more statistics than that. It depends on the driver, and those are not the same for all drivers, or for all mix. Okay, so now we explored the hardware, we looked at the statistics, we were watching them, the watch tool is probably our friend here, and here, let's go further into the driver. First the driver may be dropping packets too. Why? Because for example of memory pressure, we may be out of memory, or the DMA transfer is another option. It's more for the transmit side, but yeah, it happens. Various things can happen, like the buffers could not be mapped and so on. How can we find out? We have IP command with the dash S parameter. So let's start IP dash SA or IP dash SL, doesn't really matter, to see the statistics of the driver. Here it is, and I see I have 10 packets classified as errors, and 5 as dropped. Errors mean like the mostly errors in the packet for whatever reason, couldn't be parsed. For example, dropped are real drops, so out of memory conditions or similar things. Again the difference between ETH tool and IP. ETH tool queries the hardware. IP shows the driver the software statistics, mostly, except for drivers, but they don't. Now, okay, driver successfully received and processed the packet. What happens now? Well, it hands it over to the corner, right? Well, not really. First thing, before that happens, there's a thing called XDP, express data path. This is stuff that allows users or applications to upload their own BPF program, BPF is kind of virtual machine running in the kernel, so they just have a program compile it, upload it to the kernel, and it is run for incoming packets. Now those programs are quite powerful, they can do a lot of things, including, as you guessed it, dropping the packet. So if they return with the XDP drop return code, the packet is just dropped. They can also redirect the packet to a different interface. That would mean that the packets are not appearing where they expect them as well, so this is of interest to us as well, or they can even modify the packet. Which means they can, for example, change their type, I don't know, from IPv4 to IPv6, for example, not that anyone does it, but yeah, it's possible. Now this is, remind, I have to remind you, this is before the packet is actually processed by the kernel. So when this program finishes, the kernel thinks that the packet, it was, that was modified by XDP, is the packet it was received by the buyer. So if we're not getting the packets we expect, it's maybe because they were modified. How can you find out? There is a BPF2 comment that you hopefully all have installed on your machines, most recent distributions have that, and recently it got a net comment, so if you have a bit of a distribution you might not have BPF2 net, if you do, you see this. Our ETH0 interface, the two in parenthesis is the interface index, so the number of the interface has an XDP program loaded with this ID, this is an internal ID that is used to identify this particle XDP program. As you see BPF2 net is also listing some other BPF programs. So this is useful, we'll get to that later. Okay, now if you have an order distribution or just want everything in one place, IPA is your friend because it lists the XDP program as well. So let's say we don't have, oh, one more thing. If you want to see what the XDP program actually does, now that is much more complex thing to do, you can use BPF2 to actually dump the program so you can dump the instructions of the program and use your disassembly skills to figure out or just find a program that is actually uploading that and so on, up to you. I just pointed you to the place where it happens, or may it happen. Okay, now the packet is handed to the kernel, one of the first thing that the kernel does is it transits through the traffic control subsystem. This is controlled by tool called TC. I talked about this tool like three years ago here at Defconn, so if you're interested into a lot of details, how it works, how it can be configured, then you can watch the talk. The TC tool, it's quite powerful, it can do a lot of things. All the receive parts, what is of interest to us is something called Policing, which is basically a fancy word to applying some filters to traffic, so matching traffic by certain rules, and applying actions on the packets that matched. There is a huge variety of actions that can be applied to any, some of those are of course interesting to us. I would name drop and mirrored as two prominent examples. Drop is dropping packets, as you can probably guess, and mirrored is redirecting or maybe copying the packets to a different interface, so it can be stealing your packets and directing them as well. There's also a BPF action, which allows executing a BPF program as a result, or after the packet is matched, so as a result of the filter, or even as a filter by the way, but not much interesting to us at this moment, and the BPF program can, of course, drop the packet as well, so BPF action is interesting to us as well. Now, there is a long comment that we can use to see all filters configured on a given interface, TC filter show, dev, and the interface. I will not get into decrypting this bit long output. Again, if you're interested, both all those words and prefs and columns means watch my talk from three years ago. What is of interest of now is the action drop here. So here I see that actually all IPv6 traffic is dropped. At this point, I also remind you if we see action BPF or filter BPF, we might use BPF to net command to actually list all TC BPF programs as well, might be of help. Now, let's assume that we wrote out TC, and so the packet continues flowing to the terminal. Next thing it encounters, or can encounter is VLANs, so the packet can have a VLAN tag in it, and we can have a VLAN interface configured, in which case the packet is not received on our expected interface, but instead is redirected by the terminal to the VLAN interface. Again, IPA is our friend. I'm using, in the example I will show, I'm adding a dash D option, which lists some more details. Those are not necessary, but especially in case of VLAN interface, it shows some more details. What is important is here, the at sign and interior zero. This means this is VLAN interface actually attached to this physical interface, and with this all packets that have a VLAN header with this VLAN ID are sent to this interface. The dash D option allows me to see this line, this would not be shown without it. That's why I specified it. So if there's VLAN interface, maybe packets are appearing on that VLAN interface. Beware, VLAN interfaces can be in different namespaces, even in a different namespace than the main interface. So the fact that you do not see any VLAN interfaces does not mean that packets with a certain VLAN ID are not going elsewhere. So watch for your intramainspaces and your Docker containers and all of that. How can I find out what interface, or if I had VLAN interface in intramainspace, how can I find out to which main interface it belongs, or must interface it belongs to. If I execute IPA command, again, dash N to see all everything in that namespace, there's D to see some more details, suddenly there's no ETH0 here, why? Because in this namespace there's no ETH0 interface, remember? It's in different namespace. So IP does not know the name. It shows just this generic identifier, meaning this is an interface with index2 in this namespace. So namespaces have numbers as well as names. IP and NMS will tell you what numbers correspond to what namespace is. So you can find it that way. That's VLANs. So we have more stacked interfaces than just VLANs. We have, for example, software bridges or bonding, teaming, open-v-switch, other stuff. What does that mean? If an interface has a master, it means that all traffic that is received on an interface goes to the master. It is not received on that interface. And to mind you, this happens, you remember the flow, this happens before the TCP IP stack is actually taken into account when it is even executed. So that means that if you have any IP addresses set on your interface, this is completely irrelevant. It is not consulted, it is not looked at. So the interface, for example, if I have ETH0, it is added to a bridge, and I have an IP address on that interface ETH0, well, bad luck. All packets go to bridge first and are received on that interface. How can we find out whether our interface has a master? Well, IPA is here. So now ETH0 is connected to bridge0. So there will be another interface called bridge0, which, 15 actually. So the bridge0 is the interface that the packets will receive that. So we figured it out, we know what our topology is, what interface our master of each other, whatever. Let's move further. Next opportunity to drop packets is something you all know, and that's firewall, finally. You see, we finally reached firewall after I'm talking for half an hour, and we covered like, I don't know, dozens of different places where packets can be dropped for whatever reason. So finally the firewall. Well, actually we have three firewalls in running Scarno right now. IP tables, NF tables, and BPA filter, kind of, because that stuff does not really work yet. Let's go through them one by one. IP tables, the good old stuff, or old stuff, rather. There are rules that you can list. You can configure, of course, and you can list. IP tables, the capital L shows you all the rules that are configured, looks like that. And you see, I have a rule that actually matches all packets and drops them. So maybe, just maybe, this is the, or probably it is, this is the problem. The firewall rules can be quite complex. There are multiple chains, and it can be jumped to, and so on, and so on, to entangle them, takes some effort sometimes. Good luck with that. But at least we know that's something going on. That's actually one thing that can help us, and that's counters. So if I add dash V option to IP tables, it shows some more details, including counters. So now I know that this rule actually matches three packets already. So three packets get dropped, got dropped. And actually zero packets were accepted as expected. One thing, a few more, few more good shots with IP tables. First, you should pay attention to the whole policy. This is this stuff. If there is drop here, it means all packets that did not match any rule will get dropped. Except means obviously accepted. So even if I have no rules here, and I have the full policy drop, I will still not receive anything. Also, there's not just that one table I showed you, there are actually multiple tables, I think five different tables that are executed in various stages of the processing, the packet by the IP stack. I can specify the particle table I want to see with dash T. So it's filter, net, mango, raw, security, just look at them all. I can use IP table save there to see everything at once. So this might come handy. Although the original proposal is to use that for computer processing, it is actually quite readable also by humans. NF tables, that's the new stuff, that's the new firewall. The basic concepts are really similar. So we have tables, we have chains, and we have rules. The tables are really the same as with IP tables, so there should be no surprise here. I can list all the tables that are currently configured, because unlike IP tables, NF tables does not have fixed set of tables, so only some tables might actually be configured at a given moment. So I can see them with this command, and I can explore individual tables with the NFT list table and the table name. Or I can use one command list rule set, which will show everything in nicely structured format. So this is an example, table, filter, chain input, and here I am dropping all packets. Now you might be asking, OK, that's nice. So how can I see the stats? The answer is you could not, unless it was configured in a way. Because with NF tables, the counters are not present by default, but you can configure them. So you can actually modify the rules, add counters, and then see them. See them and know how to do that. Note that IP tables and NF tables are not actually mutually exclusive. They can coexist, and actually both can be applied. So look for both. It was actually intentional decision to allow smooth transition. OK, then I talk about BP filter, which is a new experimental stuff. It tries to translate the IP tables' commands into BPF. But in fact, it's XDP. It transfers it to XDP. So it's not really at the firewall level at the TCP IP stack I listed it. That was incorrect. It is executed at the XDP level. So that means at the firewall level. So nothing to see here. We were looking for SDP programs, remember? So we would see that. Now, you may be thinking, OK, I have these cool tools. I have R-shark. I have TCP DOM. I can see all packets, right? Well, not really. First thing to know is, where does TCP DOM really sit? Where is the place where TCP DOM, or R-shark, or other tools, from what point on that pipeline I showed, they take the packets? And the answer is here, this point. Right before traffic control. What does that mean? That means that packets that are dropped from that point on are seen by TCP DOM. The packets that are dropped before this point are not. In particular, TCP DOM does not see, obviously, packets dropped by hardware. That's obvious, because even the operating system, even like no software actually sees them. It does not see driver drops. So if there is a memory pressure or something, those packets TCP DOM does not see. More importantly, it does not see packets dropped by XDP, because XDP is executed at the driver level before the packet is handed over to the counter. So I will just let that single bit. XDP programs can do things that TCP DOM does not see. Question for the audience, if XDP modifies a packet, what does TCP DOM see? The modified packet, yeah, correct. We have also one other tool that is very, very relevant to this talk that's called Drop Watch. Drop Watch operates at a different level than TCP DOM. It does not capture packets, but rather it watches all the places in the kernel when packets can be dropped. So whenever any part of the kernel calls an internal function that actually destroys a packet, Drop Watch sees that. And it periodically reports that to you. How does that look? Like this. So it shows that in the past time interval, one packet was dropped at this kernel address. Obviously, it requires some knowledge about what those different functions mean in the kernel. Yeah. How can you start Drop Watch? Drop Watch-LKS, this parameter means that it should resolve these numbers, the addresses into symbol names. You will need kernel debugging whole package for it installed for this to work. Then you specify start, and then it does it drop. We actually have more things that we can use to see even more. Oh, OK, one more thing. What about that? Drop Watch sees almost everything. I'm saying almost. Why? Because, obviously, it cannot see the hardware drops. That's like we're saying this is a good thing. But it also does not see XDP drops. Why? Because XDP does not really operate with packets. If you watched my talk two years ago, it does not operate with SK buffs. So that means that the packets are not really freed using the usual kernel functions. So Drop Watch does not see them. So XDP Drop Watch does not see. We can use Perf. This is an almighty tool that can actually watch anything that's going on in the kernel. If we use it right, we can even see XDP drops. But this is really something for experts. I encourage all of you to look into that. But if you're scared, don't worry. It is really, really complex. So that was receive part. Let's look at the opposite direction. That's transmit. And I will be really quick here because it's basically the same as receive part mostly. So on the TCP IP level, there is IP tables and NF tables that can drop packets. Comments are the same as we have shown. There is no BP filter at this point because there is no XDP support for outgoing packets yet. So that's the TCP IP. Then there is traffic control. Traffic control, again, filters, actions, packet drops, or mirrors. There's one more thing in the traffic control layer on the transmit part. And that's called shaping, usually, which means packet can be delayed on the way out. Might be of interest to you as well because, as I said, this is just delayed. So it should not affect you that much. Anyway, this comment can show so-called QDC prints, which are responsible for delaying the packets and stuff like that. If you see net end there, be especially careful that this really powerful stuff that can do a lot of nasty things with your packet. Then there's driver. Can drop packets because of memory pressure, because of failed DMA transfer, whatever. And there is hardware, which can drop packets again because of its own reasons. We can use the IP-S for driver drops and EtH-joulish-capitalS for the hardware drops. As I said, no XDP yet on egress, so this is on transmit. So this is actually a bit easier on us. Now, one last thing. And the packet capture TCP dam or Wireshark on the transmit part because if you ever started that, you'll notice probably that there are also outgoing packets. So where does that sit? It's right before driver, so it's after the traffic control. It's really symmetrical to the receiving part. But it means it doesn't see drops here and here. So packets drop by a firewall, packets drop by TC. TCP dam does not see on transmit. And with that, we have like five minutes? Five, 10 minutes for questions. Yes? What about the EB tables? Where are the three tables? OK, that's a good question. The question was, what about EB tables? So EB tables are a thing that is really specific to bridges. So when I talk about the stack interfaces, when you have a bridge, you have your interfaces connected to it, so the packets flow through your interface to the bridge. Now, the code in the bridge driver's executive is processing those packets. And part of that code is calling EB tables for filtering. So it only affects packets that go through super bridges. And it's like tables, and yes, indeed, packets can be dropped there as well. By the way, NF tables, I'm not sure does NF tables do support bridge tables or not? Yes. They do. OK. Yeah. So NF tables can do that as well. There's actually more. There are actually app tables. There's more stuff like that that I intentionally omitted. But yeah, good point. Another question? Yeah. How many EBPF VMs are in the kernel? None. Usually. OK, so let me be more precise. So VMs as a virtual machine executing EBPF code, it's just like one or. I mean, EBPF code is jitted. So when it is uploaded to the kernel, it's transformed to native instructions and executed as native instructions, in fact. So maybe the question should have been how many different points at which DPF programs can be executed are in the kernel, to which I would answer numerous. I covered actually two most important when we are talking about packet flows. And that's XDP and TC. But there are actually more points where the BPA programs are involved on packet traversal. Another one might be, for example, flow detector. This is the code that is responsible for parsing packets. So figuring out what are IP headers, TCP headers, and so on, this can be enhanced by specific EBP programs to allow custom parsing. Another point is right before packets are received by your application. There are some clever socket filters using BPA. Or for containers, there are C-group stuff, like control groups. So this is stuff that is used to limit resource usage and other usage by a group of applications. And then can be some BPA programs attached as well, which can control, for example, the way sockets are used by applications. Yeah. This is not exhaustive list. This is just examples that are relevant here. OK. Yep. That's a good question. It's a really good question. So the receive part is the interesting one. Yeah, thank you. Thank you. So the question was, when there is a bridge involved, what does TCP dump see? Or where it is attached? So I said the TCP dump is right here. And stack interface resource is actually here. So if I receive a packet on ETH0, it goes to driver, XDP. Then it's seen by TCP dump. Then it goes through TCP. And only then it is directed to the bridge. Yes. So when you start TCP dump on ETH0, it sees all packets received on ETH0 before they are handed over to bridge 0. Now what happens when, so OK. So again, ETH0 receives driver TCP dump traffic control. Now it goes to bridge 0. It is received there. And it's actually looped in here. So TCP dump sees that again on the bridge interface. Traffic control is run again on the bridge interface. And so on. Another question? Yes? I did not get the first sentence. Sorry, you said what is one only? So every physical means? Yeah. So the question is that I said that a single interface can be in a single name space only. This obviously affects physical interfaces as well. How comes that packets flow into containers or other name spaces? Yeah? This is where virtual interfaces come into play. So you actually have to set up a virtual network. You have to set up more interfaces. So usually, how does it work is you set up a pair of virtual interfaces called VTH. Those are virtual interfaces that are connected to each other, just two. So whenever you send to, whatever you send to one appears on the other and vice versa, you stick each end to a different name space. And in the root name space, you bridge all of those together, all of those endpoints. So now you have virtual network inside your computer. OK, I'm not sure I got it correctly. So please quickly. So the question was, if I move my physical interface, it takes you to a different name space, does this mean that Firefox and so on do not have internet connection? The answer is yes, unless they are in the end-fucking name space. So you can move applications between, or not really, applications can be started in different namespaces, or they can actually move themselves between namespaces. So if they are in a space where there isn't for network connection, or not network interface, they obviously do not have network connection. So you're saying that you're sure that there was a case where your own network interface disappeared, and you were able to browse the internet. This could be possible only if your browser would be in other networked namespace. You can actually find out from ProcFS, so slash proc slash PID slash namespaces, I think. Thank you. Thank you.