 Got to make the introduction to none without much to do Talking about high-speed fast for packet filtering and from that cloud fair cloud flare from the London office Welcome to DEF CON and my pleasure to introduce you to give me a little bit in Okay. Good morning. Thanks everyone for joining me today Today we talk about past present and future of high-speed package filtering on Linux. So let's get started About me. I'm a system engineer at Cloudflare. I work in the DDoS mitigation team in the London office For the ones of you who don't know what cloud fair is cloud fair is a security and performance company Which offers a reverse proxy CDN service So when a site is on cloudflare all its traffic is routed through our edge network first and then it's proxied back to the Origin servers. This means we can provide different services such as caching content optimization and most importantly for the sake of this talk We can provide DDoS mitigation We have a fairly large network. We see a lot of traffic every day and also a lot of malicious traffic And in fact on a daily basis we have to deal with hundreds of different DDoS attacks To give you some numbers on a normal day We see attacks ranging from 50 to 100 million spackets per second and from 50 to 250 gigabits per second But we have also seen much bigger attacks These graphs for example shows one of the biggest TCP sync flows we recorded a few months ago The green line represents the volume of the attack traffic while the small Yellow line at the bottom represents the ledger traffic and the job of my team is to basically drop all the green traffic Because all the malicious traffic before it eats our servers without affecting in any way the small portion of yellow traffic or ledger traffic Okay, so let's start from IP tables in the beginning. We were relying simply on IP tables to To drop traffic and IP tables is great for many reasons because it's a well-known tool It's easy relatively easy to interface with it from different programming languages. It has this nice concept of tables That hooks into different parts of the network stack and chains It's also well integrated with the Linux kernel and it has support for BPF matches Which means it's possible to use BPF bytecode to specify how a rule will be Matched and this gives IP tables a reasonable flexibility and speaking of BPF in the last three years We've been developing a set of utilities called BPF tools Which allows to generate on-the-fly BPF bytecode to match a specific class of packets in this example I'm basically generating BPF bytecode to match a specific class of TCP sync packets and Without going too much into the details the weird string that you can see passed to the BPF gen command It's called Poff Signature and it's basically a short description of all the interesting fields of TCP sync packet And the way we used to use IP tables is the following We use the usual IP table syntax to express most of the mitigation logic And we resort to BPF to express those bits that cannot be expressed by just using IP tables and this works great But there is a problem With IP tables we can't handle big packet fruits and last time we checked We could like do two three millions packets per second on a single box leaving no CPU At all to the user space applications and this is for us a problem because on our network each server Which serve HTTP and DNS traffic also participates in DDoS mitigation so we can't just afford to spend a whole box just mitigating DDoS attacks There are a few Linux alternatives, but we didn't really consider any of them And the reason is that we are not just trying to squeeze some more million packets per second of DDoS mitigation capacity We'd like to have a solution that we can use that uses as little CPU as possible to filter packets at line rate So before discussing our current setup I'd like to give you a quick introduction about the path of a network packet in the Linux kernel And this is to make it clear why it's so expensive to filter packet by just using IP tables So first let me define a couple of important Data types that I will refer later during the talk So when I speak about packet buffers the one at the bottom. I'm just referring to Memory pages which contain the actual network packet when I'm speaking about SKBuff I'm speaking about a big data structure that contains a lot of metadata related to a single packet buffer So basically it's just a structure that keeps the context about a specific packet where the packet is being processed by network stack and when I'm speaking about Eric strings the one on the Nick diagram and speaking about data structures used by the network card to keep track of all the packet buffers that are actually in use Okay, so how does Linux receive packets Linux uses a mechanism based on polling So basically periodically Linux invokes the net rex action and goes through the list of network cards and for each network card It checks if there are new packets and the important Point of this slide is that just the network driver has to do a lot of expensive operation memory operation To deal with the packets. It has to DMAM map the packet allocate an SKBuffer and Allocate a new page for a new packet that will be that will replace the old one and actually process the packet and in the end Free all these data structures. So there is a lot of memory pressure just from from the point of view of the driver so I want to really quickly go through this cold trace of the Network stack receiving a packet just to give you again a quick sense of how many operations The network stack does when it receives a packet and so why it's so expensive to actually drop them So as I said net rex action is called periodically which goes through which calls the driver function and we have to allocate SK buffers and There is some generic receive-of-load processing and this is repeated for each packet Then the driver has to allocate new pages for the new packet that will arrive and there is IP other processing IP tables row contract More contracting routing decision. Finally. We encountered the IP tables chain, which is fairly late in the In this whole list of calls And finally layer for protocol handling and I omitting other other things that we don't care here So what's the point of this the point is that we should not say that IP tables is low We should say that IP tables is just executed too late into the network stack And we have to do all these actions for each packet that we're going to drop and that's why it's so expensive to drop packets And so we should move to our current solution, which is based on user space upload I want to say user space upload. I mean offloading to user space the task of Dropping the packets so user space of load is based on a technique called kernel bypass kernel bypass means taking some of the network card the receive and transmission rings Mapping them in user space and allowing a user space program to deal with them. So basically partially the network card is detached from the From the network stack from the linux network stack and the user space program can take control of that so it can write packets from there and reads packets from there and one of the reasons this Technique is used is to implement user space network stacks because a user space program can just read and write packets from from the Nick directly without having to deal with the kernel But this technique is also useful to filter packets because you can map receiving in user space and a user space program can start reading all the packets from there without any interference from the kernel and And then the user space program can filter actually what are just buffers of packets. So the way this works is by just selectively offloading the Attack traffic to a specific receive queue and Then after that a user space program and we'll be able to read the packets from from the queue And if the packet is a good one it will just be re-injected into the network stack and if this a malicious one There is nothing to do really because of the way this framework works So instead of having to allocate and allocate packet buffers like the linux kernel does This framework you usually use circular buffers So there is no real work to do in case we are dropping packets and since in case of big fluids We are dropping like 90 percent 99 percent of the traffic. This is a big save in term of performance so We use a technique called partial kernel bypass because we keep the network card most of the receive rings attached to the To the kernel network stack and we bypass only a single or a couple of receive rings And we use what is called an Eric's flow hash rule So basically we try to steer all the potential belt traffic to a Rxq in the case in this case Rxq And and then we filter it and when I say potential belt traffic I mean for example traffic that is targeting a specific destination IP So we can leave the kernel work on the normal traffic as usual and we can filter in user space the bad traffic So there are a bunch of like different frameworks to do this that do exact more or less the same thing and In fact from the user space point of view The way and a user space filter is done is the following So a user space program keeps asking the network card for new packets and as soon as there is a new packet The the program goes through a list of rules And if any of them is a drop the packet is dropped which means we do nothing and if that packet is not a match for the rule It means it's a good one. And so we range acted in the network stack So this solution works great because we can like filter six to eight millions packets per second on a single car Compared to two three millions packet on a whole box. So this is a great performance gain But unfortunately, there are some limitations about this solution because Legit traffic has to be re-injected and this can be sometimes expensive depending on the framework that we are using one or more core have to be reserved completely for this kind of thing because the way we do it is by busy polling the nick and this is because we want to keep the latency as low as possible and Also, you have to pay a lot of kernel space and user space content switches every time you move from kernel to to your filtering user space program And so the the future XDP or express data path So XDP is a new technology recently introduced in the Linux kernel And it's a concrete alternative to IP tables or user space of load And the idea is to filter network packets as soon as they are received by the network card So using an ebpf program which will take as input a packet and it will produce as output What's what is called an XDP action which can be XDP pass so the packet should proceed to the natural stack XTP drops so the packet should be dropped and there are even other actions and other cool things that you can do with XDP But that we don't care today So if you remember the five long slides called trace This is where XDP is run So just at the beginning just as soon as you receive a packet inside the network driver So before even allocating SKB's you run your ebpf program and you decide if you want to leave the packet flowing through the network stack or you want to just drop it and If we take a look at one of the drivers that implemented XDP We can see how the XDP drop actions is actually similar to what we are already doing with user space of load Because basically there is nothing to do if you want to drop the packet you just leave You just do nothing you just leave the packet buffer where it is in the network card ring and you go to the next packet So there is no That there is no memory Allocation the allocation cost at all Yeah, so XDP has the same advantages as user space of load There is no kernel processing involved and there is no memory allocation and their location cost now the MMA Map and a map cost which is a very expensive operation But also it has a couple of advantages because you can use it ebpf to program your filtering logic and there is no need to Re-inject packets because if you are if a packet is not going to be dropped it will just go through the network stack the usual way So let me spend a couple of words about ebpf So ebpf is a kind of new I think that's it is two years old technology in the new kernel Which is gaining more and more traction in multiple subsystems in the new scanner for example in the tracing subsystem and It's an extension extension to the classical BPF and it's close to a real CPU architecture And in fact it's also jitted on many different architecture, but the great two things of ebpf is that It provides safe memory access guarantees and time-bounded execution So when you load an ebpf program in your kernel You know that it will be as safe as running a program in user space because the program cannot access Random region in the kernel and the program is guaranteed to terminate because No back for jumps are allowed. In fact every time you load an ebpf program Kernel we run the kernel we run a very fire and if your program does not meet any of these Guarantees it is just rejected and also ebpf provides shared maps with user space so you can have Some kind of state shared between your xtp program and your end user space and Another great thing is that there is an LLVM backend compiler So basically you can write your C program and compile it to ebpf and then load it in the kernel Which is super handy and Okay, so this is not an xtp tutorial, but I just wanted to show you a quick sample of an xtp program in this case we are This is a super simple example and xtp program is the function that is called on each packet it receives as input a context which is Just a couple of pointers to the begin and the end of the packet and Then we can just access the packet using the linux kernel data structures that are already there and So an important thing to notice is that we are We are actually making sure that the program is not reading past the The buffers the packet buffer boundaries, otherwise It would be not possible not even possible to load the kernel the program because the kernel would comply and Then we can go on and check that the packet is an epi packet and access the epi header and make sure again that We are not writing past the buffers and return an action based on The the filtering logic that we use Okay, the second thing is maps So I talked about maps and that they are super simple to use So you define it you define the size of the key of the values and the size of the map and then using BPF helpers you can just access a specific key and set the value and So it's just like a normal esch table and it's shared with user space and And the great thing is that we can of course how to generate xdp programs because it's just C So it's easy to write a script and actually generate C code for you automatically and in this case I'm generating the same The program for the same pop signature I showed you before with PPF tools, but instead of like outputting Cryptic BPF by code we can output actual C code and then compile it to ebpf and just load it so I will not go through the the whole code, but You can just you you can access different header the IP header and the tcp fields that you are interested So if you're curious and you want to try it the first thing to keep in mind is that xdp is a new technology as a set So it requires support in drivers. So as Initially only the melanox card supported it and now that at each kernel release there are new drivers that Get support for xdp, but starting from linux 4.12 There is a new feature called genetic xdp So even if you don't have one of the nice of this nice network card You can just run xdp on your laptop and just play with it and see how it works There are some samples in the kernel directory. So in slash sample slash BPF you will find Some xdp programs and there are also a couple of Libraries that will help you load and manage the xdp programs And that's all for me so questions so you How many filters can you load with xdp? xdp uses ebpf ebpf as an art code size of 4,096 instructions So as long as you can stay in that size of the program you can load you can chain multiple Xdp functions you can do whatever you want You can even change that limit if you know what you're doing But the idea is that you should try to keep the ebpf program short because it's running in Kernel space, so you don't want to hang the whole kernel white processing packets Do you make a distinction for? v4 packets and v6 packets are those two separate programs or is the same program? This is up to you, but you can for example Start processing the different header and then check which version of IP you are dealing with and then call another Function because you can it's just see so you can chain multiple functions as you would write in in C Mangle and raw before going this whole route Did you explore mangle and raw tables and IP tables before jumping out and doing the the direct? if I did The mangle and raw chains were that happened before No, this app that the two IP tables chain happened after xdp Okay No more questions. So Thank you