 So, all right, so the talk is about bpf kernel infrastructure. So I just gave a flash talk at the Audi one for, so, so a bit about me, I'm a performance engineer at Radhat team. I basically work at the networking performance team, partly inside it's OpenShift and things like that. I've been organizing LXD, which is the next group in Delhi for a while, I think one year. They have been doing very good work, if you are in Delhi, go ahead and do those. So this talk was supposed to be exploring BPF use cases. So that means I'm going to show, I was supposed to show like what all BPF can do, but reviews and other people helped me realize that not many people understand BPF and know what it is. Can you start the timer? So I converted that talk to something introducing BPF and I wanted to go as deep as possible, so this is an in-depth introduction to BPF. I'm not going to go over a simply and things like that, but as low as system calls. I'll start with BPF, like what exactly BPF is, and then move on to BPF, why BPF came across and talk about something called XTP. If you have attended load balancing talk, I have some surprises for you, then Q&A as usual. So, how many people have heard about TCP dump or use TCP dump, can I see a raise of hands? Awesome. Everyone. What it does is it does packet filtering, and it does, it's used for network debugging. It would trace your packets that are coming into kernel, and then it would filter that and it returns back to the user space for you to see. So the original TCP dump program use something called BPF. So you have been using BPF without even knowing if you're using TCP dump. BPF is a code that runs TCP dump. It's a virtual machine that is what doing the packet filtering for TCP dump, and it uses the space. So what it is, it's a virtual machine, and it's inside the kernel. That means it's running inside the kernel. It has its own custom instruction set. It's a special purpose VM. It's not a general purpose VM. That means it cannot run an operating system. It's used for a specific purpose. That is packet filtering and aggregation. And yeah. So why was TCP dump or BPF created originally? There were many packet filters at that time. So Vanzikops, and if you're aware, is one of the people who devised the TCP consumption control algorithms, which probably saved the internet in 1980s, 90s. So the few problems they came across was all the packet filters at that time had very large overhead. That means it has to do a lot of context switches. Say if you're trying to debug a million packet filter flow, and then all the packet has to go all the way up to from kernel space to user span, and there is a lot of context switches that happen. So that is very expensive. And since it has to travel all the way up from network driver to all the way to user space application where your packet filter is residing, so that it can finally do the filtering, that's very expensive. So the two problem that it solved was BPF was in kernel. That means it's doing, there is no context switches. The context switches is minimum because the only context switches happens, it's for user space to see what's happened. That's the only context switch. And then it's doing the packet filtering at very early in the stack. So there's a lot of memory, a lot of processing has been saved. So that's how it looked like. You have a driver, you have your BPF virtual machine which have its filter and it's inside the kernel. But it was made for a specific use case. It follows the UNIX philosophy. The problem was that you have your user space application that is a TCP dump and you write some program in TCP dump and you are putting that program inside the kernel and it's running inside the kernel. So you're interacting with the kernel without having to reboot it. But you are limited to what TCP dump provides. So whatever syntax TCP dump provide, you have to make do with it. And the instruction set was very limited. Also, the problem was there was no other API to interact with it. Once the program was in, you cannot do much with it, apart from loading the program with TCP dump. But apart from that, you cannot interact with the program. So with these problems, what was created was eBPF, which is Extent Berkeley Packet Filter. People at Plum Grade, they were working on an SDN product. They wanted the ability to interact with the packets and things like that. So they enhanced Berkeley Packet Filter to some features. They added some features. So TCP dump was created, I think, the paper was back in 1992, which was the time for 32-bit registers and all that. That was a huge time for 32-bit. Now it's 64-bit, so they moved to 64-bit and they add a couple of registers. So 10 number registers, what they wanted was they wanted to cover one-to-one mapping with x86 instruction set. So it was inspired from that, so you can have one-to-one mapping with that. And just support what I added so that your eBPF code or whatever that is running inside the kernel can directly convert it to whatever architecture that you have, say x86, pp64, whatever. And one major thing was a BPF call was added, syscall was added. That means you have a way now to interact with the program that you have loaded. So when people say eBPF or BPF, they basically mean eBPF. The original BPF that was used in TCP dump is now in newer kernel, it's converted back to eBPF. So there is no original BPF. Whatever you see is eBPF in your newer kernels. So when people say BPF, it means eBPF. Or explicitly mentioned, they have to say classic BPF or CBPF for the original BPF. All right, so because of the features of registers, adding registers and 64-bit register support, a lot more use cases kept up. So networking features was the original intention. And then tracing came up, then Seccom came up, and there are people doing a lot of hacking together and bringing up a lot of more use cases. One of my favorite one, if you attended the load balancing talk, Sree Agarwal mentioned that he was trying to do. He mentioned that there was no well-known kernel space program to do your load balancing. So what Facebook is doing is Facebook is using something called XTP for deploying load balancer. It acts as a one-legged load balancer. What it means is that usual cases load balancer require entire server for doing their work. But in this case, the load balancing program has so less overhead that it's entirely doing everything in kernel. So the server itself can be the useful load balancing itself. So Facebook has already deployed it on their data centers and are seeing significant improvement performance. Cloudflare is also using in Forty-Dos production and another favorite is Celium. I came to know that this solution was using Celium for the Kubernetes purposes. All right, so let's take a look at how it works. I might be going a bit fast from now. So how the general architecture looks like, I will explain each of the components. So you have a user space program that you will write in the user space, like Python, C or whatever. And then it will be compiled to eBPF bytecode using LLVM. So LLVM has its support. And then you will use the BPF system call that I just mentioned to load the program. And then you have a verifier. So running inside a kernel is not considered safe. It's a very error-prone situation. So you have something called a verifier that makes sure that your program is safe, I'll talk about it. And then you have a virtual machine, and then how it works is it has a couple of hooks. So how hooks into the subsystem, kernel networking subsystem. Not networking, but subsystem, kernel, subsystem. So let's say you have your kernel code and if that code is traversed, you will have your BPF program executed at that time. And that BPF program will be run inside the virtual machine. And then whatever you want to do, it will do that. And then if you want to interact with the user space and the kernel space, you have something called eBPF maps. Maps is data structures, I'll talk about it also. Which lets you interact with the user space and the kernel space in having a user space program and let's you interact with the kernel space. So it's a two-way communication. So the BPF system call looks something like this. It lets you have the ability to load the program and create maps, which are data structures. And it will take attributes for what the program types are and what the map types are, since it's a data structure size and all that stuff. So maps are basically data structures, again, it's a key value pair. So it will have hash types, per-CPU arrays, per-F arrays, lots of things, histograms, lots of things. So it returns a file descriptor back to the program in the user space program. So you can use that file descriptor to interact with the kernel, kernel space program that you have just loaded. And it provides some kernel, in kernel helper function. These functions are implemented in kernel, I'll talk about it also. One thing is that you can have multiple maps for each program and each program can access multiple maps. All right, so map types are just mentioned. You have hash, arrays, perf event array, things like that, stack. So you access your map with proper functions, since it's very hard to do directly with the BPF. These are just helper functions to create the map. Internally, it compiles down to the BPF sys call that I've just shown. And the attribute that it takes are like map type, key size and all that stuff, which will be supplied to the BPF system call. All right, and then you have your helper function to do operations on the map. So this is a user space program, and the map is residing in the kernel. And you are accessing the map from a user space program. All right, so the helper functions are there to ease out the task that can be considered hard with the BPF. So you don't want to do a lot of assembly code or the instruction that the BPF VM has, you don't want to do that. So that's why it's a bit hard and people have implemented a lot of internal helper functions. Also, as mentioned, there are a couple of program types. So each program with specific use cases such as XTPs is one use cases, C group, SOC is another use case. Each will have their hook into some subsystem, which they want to connect to. And whenever that hook, whenever that code travels, the BPF program is run and it's running inside the VM. So the program types are socket filter, you also get to interact with the traffic classifier, both event, yeah. Loading the program, you again have a wrapper function. So there is a library called libBPF inside the kernel, which provides this wrapper functions to ease out the task. The struct looks like this, you have an instruction count. So there is a limitation with BPF on the amount of instruction that you can run, which is I think 4K bytes, 4K instructions. But with pseudo program or root access you can do as much as 1 million instructions. You need to write this program in GPL licenses. And you need to specify the kernel version so that you can use what features are available with the kernel. So it's running inside the kernel. So will you consider it safe? I don't think so. It's running inside a VM, still no. So that's where Verifier comes in. It ensures that there is a memory safety and the program is terminated safely. There are a couple of stages of EPPF Verifier. So the first stage is it does DAG check, which means that it checks for the control flow and ensures that there are no loops, the program terminates definitely. And it checks for the next stages for checking registers and stack changes. It's what it tried to do that it will simulate the entire execution. And it will try to see if the program is trying to do maybe unreachable instruction or trying to read from an uninitialized register or insecure mode. If it's trying to do some point arithmetic and memory access, out of bound memory access it will check. So it will complain all the time. It's a very complaining thing. So you need to ensure that if you're writing a BBF program, you need to ensure that you do your checks correctly on your user space program itself, check all the register and check all the return value that it's doing. So I hope the architecture is clear now. So how will you use a BBPF? I mentioned a lot of assembly and instruction and all that stuff, see. It seems to be very complex. So first thing you need to ensure that you need to have your updated kernel version, get the latest one and enable the kernel configs as present here. Why I need to, why you guys need to have updated kernel version? Because the features are added in stages. People are coming up with more features day by day, because it's an active area of development. The original classic BBF looks something like this. You have your TCP dumb program that is listing on some interface on port 22. And the program would look something like this. This is a C program like fragment. And then you will load the program with the S or attach filter. And yeah, it will load the program. It's doing what the TCP dumb program is doing exactly. But you don't want to, it doesn't make sense for anyone to write this. So you have your eBPF instructions. Again, you don't want to write this. So LLVM added support for BPF. So there is a eBPF compiler in the back end for LLVM. You are able to write a subset of C. You can write C program to load the program. So what this program is doing is exactly the same as this one. It's doing a lookup. And you have even more tools to ease out the development. There are BCC tools. BCC tools are libraries that allow you to write your program much more easily than writing assembly instructions or things like that. So BPC program looks something like this. You have some inline code for running your application, which will do the tracing or in case of networking, you can do that. And for processing your data, you can use Python or Go or pretty much, I think Rust, do lots of languages right now supporting. The map creation is much more simpler here. You just tell what maps you need and you will get that. And you do the formatting afterwards. So the tool set with BCC are increasing day by day. I can show some examples for this. Let's say you want to do a virtual file system. What you can do is you can see what all the virtual file system are doing read operations. So it's a tracing example. So it would trace the VFS read function call. It's tracing, it would take some time. So I can stop whenever I want. So if you're running some application and you want to check whether at this particular time, what's the VFS read doing? How many time it taking? So you can run that specific interval and specify your control C. I can hit control C at any time and end the program. So in this case, you can see there is a histogram. So it enables you to find where the outliers are, meaning that if some application is taking a lot of time, in this case, there is one function call that is taking like four microseconds to eight microseconds. And rest of the applications are working perfectly. It's 65 major one. So you can do that. So another networking example I can give is a TCP drop. So the thing is that many people can write their program on their own. So all these programs are, I think, close to 200 to 300 lines of code and most of it is like command line processing because it's a command line application. You have your passing, argument processing and things like that. So your logic is very less and you can write, you know, replacement for strays and all the frays and all that things. Due to the limitation, I'm not talking about those tools as of now. So TCP drop is not mentioned here because people are developing tools every day. The future for TCP drop is that it tracks something, it tracks a function called TCP-drop. It was added in Linux 4.8 recently. So that means if you find a function that you want to trace, you can do that easily. So it will do some network stack tracing and the good thing about BPF is that you can access the kernel data structures as it is. So in this case, you have a socket state, but now it's close. So even if you don't understand this, if your application are receiving some drops and you can at least try to see what has led to these drops. For example, in this case, this layer three is working fine. We have something in layer four and the drop is happening at this stage because this is a function that called TCP drop and then you can slowly trace back to where exactly the problem occurred. So this is a very good tracing tool that people just write whenever use case are coming up. All right. So even more easier way to write BPF program is via BPF trace. BPF trace is a high level DSL. That means like my SQL you have from star select statement, BPF trace also helps you with writing E, help you with easing your ability to write BPF program. So in this case, you have your one-liners and you can in this case, you're trying to trace the enter system call and trying to count how many has happened. You can even write in the scripting manner so that you can do much more formatting for the user. And if you're aware of system tab, it's supposed to be a very widely used tool for call tracing and debugging. But it supposedly crashes a lot of your kernel. You cannot use this on production. It's very dangerous, you can say. So with the version 3.2 release, system tab now supports BPF internally. So whatever system tab instruction is created, it will now run inside a virtual machine. So it's much more safer. All right, so the next segment is XTP. I'll just mention that I'll be working, I was working with XTP at Red Hat for a while and we were able to do a couple of tests and I'll show just now. So a brief is that XTP is a hook of BPF. So that means it's a BPF program which the hook, where the hook exists, it's very early in the network stack. So it's straight out of the driver. That means when the driver receives the packets, the XTP program kicks in. So what it allows is it allows you to do very fast packet processing. So in the load balancing talk you, if you are able to use XTP for your load balancing purposes as I talked about that Facebook is using, you can get much more than 80K TPS without having to crash your systems. So how it works is you have your general packet processing at layer four which is the IP tables, layer three, which is the IP tables. So if let's say you have a packet which is supposed to be dropped. So it doesn't have to travel all the way up to layer three like creating a SK buff, you have extra memory, you have extra processing that needs to be done. If we can drop the packet simply right here in the XTP hook, we can directly drop that program. So it would save a lot of memory and a lot of processing. That's the idea of XTP. All right, so the good thing about XTP is it doesn't bypass the network stack, meaning that you have whatever the kernel stack security with you and the packet that, if you don't know what to do with the packet, it would just return the packet back to the kernel. It doesn't take over the whole NIC. You can design the XTP program as per queue also. There is development happening. So what it means is that you can still use the NIC. So other fast to data path programs basically take over the NIC and you cannot use it. Maybe it's not even shown in the network kernel statistics also. But in this case, with XTP, you don't have that. It's part of the kernel tree. That means there will be more support with vendors. Netronome already supports offloading, BLNXT, lots of network vendors, also already supporting many features, even when it was just starting to grow. And you have a couple of actions. So as to mimic IP tables filtering capability, IP tables also have accept and things like that. You have something called drop. So if packet is not supposed to be on a system, you will just drop it. It's supposed to be transmitted. It will just transmit out the same way it came across. It needs to be directed. It can be directed from one interface to another interface and you don't know what to do, you just abort it or pass it to the kernel stack to do the processing. All right, you have two way to load the program. As mentioned, you have a user space program to interact with the kernel space program, which is the EPPF program that is running inside the VM. And you have another IP Route 2 program that can help you to load the program also. All right, so XTP in real life, we were able to test out some scenario and trying to, at performance team, we were trying to test out its limit. This is very initial test. So numbers can vary a lot in the latest stage because XTP is an activity of development. So DDoS, we were able to test a certain DDoS scenario. So we were able to compare it with IP tables, which is the existing network filter. IP tables is actually using the net filter back in because it's the latest kernel. By default, it's using the net filter back in, which is also an implementation. The system we had was on Unix with 40 gigs capability. The kernel was 4.18. So our test setup looks something like this. We have two system connected back to back. They are connected via wires. We have two flows that are bi-direction. That means both the interfaces are sending traffic. What you have is you have your TCP flow, that is 90% of the stream, and you have your UDP flow. What we are doing on the device under test is we have a network filter. One time, we will load the XTP program, and then at the other time, we will load the IP tables program and do the filtering. We will simply drop the IP tables TCP protocol packets and pass UDP packets. And at the traffic generator side, we will measure that how many packets of UDP have we got. So we have certain criteria for the test to pass. We have mentioned that only 0.002% of drop UDP packets have been dropped and TCP packets are received on the interface. So one by one, we load the program and these are the results for single queue. So what that means is single queue means whatever packets you receive will be processed by only single CPU. So with IP tables, we have raw mode, which is the fastest one, which kicks in very early. So with IP tables raw mode, we were able to see close to two million packets per second. This is million packets per second. And with XTP, we are able to see five out six, so which is a great game when we try to test out the DDoS scenario. And if you want to see scalability, IP tables does perform a lot, but XTP because of the simple idea of processing your packets very early in the network stack, it is very fast. With offloading support, it's breaking barriers way high with smartNICs coming up with Netronome and all that windows. So one interesting thing is that there is a BPF filter branch in the kernel Git tree. So the work that is happening there is IP tables is supposed to be replaced. So the newer implementation is net filter that is I think 2016 if I'm not wrong, but still people are still using IP tables and Kubernetes has like a lot of problem with IP tables and you have like 100,000 rules of IP tables. So with the BPF, it's trying to solve that problem, but you have that with IP tables rule will now be converted into BPF rules internally. So there will be no net filter. So imagine if it's added support for XTP. So XTP, as we've seen, the results are just staggering. So with this, we can see that IP tables will have a lot of performance gain in the future. Offloading support is there. You can check out this Netronome blog. They did a few tests with the offloading and yeah. So BPF is exciting, the future is exciting. Many people have been using it apart from networking and tracing use cases such as this research paper presented at a little scorn. If you're aware of fuse file system, SRSF, things like that. So these programs are user space file system. So that means it has a lot of overhead of conduct switches. So somebody tried to do a research with the BPF. They, what they did was why don't we do the operation in the kernel only with the BPF. So they were able to see a lot of performance gain because of conduct switches. And then you can use BPF in Android since Android has a Linux kernel, but it's not safe to directly do changes in it. So what people are doing it, you can extend the kernel features with the BPF and do whatever you want to do. It's free to do. So I'm reaching the end of my talk, way ahead of time. So if you want to do like develop your own tools and develop your own BPF programs, you need to have your updated kernel and you can play around with the BCC tools to just, I just showed you some of the tools such as function latency. There are a lot of tools. So you can even trace your user space program. If you're aware of K-Probes, URED Probes, USDT, all of these programs can be used. These are all existing kernel infrastructure. There is minimal reinventing the wheel. People are using the existing in the kernel infrastructure to do their tracing and other things. With XTP, you can write your programs and you can play with it if you want. So a couple of resources to take a look. If you search for BPF on the internet, you will get a lot of resources, but this compilation have many presentation talks, documentation, blogs, a lot of things. So you can take a look at this also. All right, thanks so much. Do we have questions? If anybody has questions, they can raise their hands and ask questions. Okay, we have one, wait, let me come. I see that you did benchmarking with the IP tables. Have you considered benchmarking with NF tables, which I believe is a lot more performant given the fact that it has data structures like sets and stuff, isn't it? So this is 4.18, right? So as I mentioned, it's using net filters as a backend. NF tables, yeah. Oh, so it's using NF tables, not IP tables. So basically you have a parameter to change whether you want NF table backend or IP tables backend. So with the newer kernels, by default, NF tables. So this is an incremental implementation that you're talking about. I mentioned that. And second question I had was, what is IP tables and IP tables raw in one of the slides? So, all right, so what you have is, with IP tables, you have five tables, raw, filter, mangle, all that. So one of the table is raw. So the raw mode actually kicks in very early. It doesn't, it comes in very early than connection tracking, not bridge checking, but yeah, it comes in very early. So if you have to do the drop, it can be done very early. So I just wanted to give a, because XDP hasn't, anyway I have an edge because it's running very early. So that's why I was trying to do with raw modes. But the, yeah. General mode, which is a filter mode, is the one with 1.33 million brackets per second. Cool?