 So, this talk is going to take 40 minutes, I will try to finish it by 30, for the 10 minutes that we have we can discuss something and then if you guys want to see some demos I can add that also. So, BPF is a, so I am going to talk about BPF today, it is a introduction deep dive, so I will try to go as deep as possible, but as far as system calls, but not internally such as machine instruction, SMD instruction, just system calls. I will talk about some use cases, some examples, yeah. So, what I do is I do, I work at the performance division at Red Hat, which is a research division. I focus on current networking with XTP, which is a BPF case, I will talk about it and sometimes when I get free time in sites, which is a proactively identifying issues in your system, events, performance instruction and so yeah, BPF, why it was created, forthcomings, then XTP and then what is the future with BPF, the usual. So, how many people have heard about TCP dump, can I see a raise of fans, good, excellent. So, TCP dump is a network filter, which helps in network debugging as well, so that is where the BPF ecosystem arrived, that was in 1992, BPF virtual machine does the packet filtering for the TCP dump, that is why it was revolutionary at that time. So, a BPF is a virtual machine, it is in kernel, that is running inside the kernel with its own customer instruction set, that means it cannot run an entire operating system, but it does a specific thing, that is it does packet filtering and aggregation and returns to user space. So, at that time, lots of packet filters were there, so why did Vangikopsen and Steve McKay and created BPF, Vangikopsen is synonym with saving the Internet in 1990s or 80s for its transition into algorithms, so why did they came up with BPF, so at that time lots of packet filters were filtering packets up the stack, that means all the packet has to go all the way up and then the filtering happened, it would require a lot of time and it cannot handle that, so running inside the kernel, it avoided context switches, I am processing, so this was the architecture back then, it did a copy, it would copy from the driver to the virtual machine and then other packets would go to the protocol stack as usual, but it was designed to do particular thing, just like the UNIX philosophy, but many people wanted to do more, we want more features, people are like that, we are not satisfied, we want more, so that's why BPF came across, so yeah, Plumgrid were the people who were like, okay we want to do something in SDN and we want to get more features, we need ability to pass the packet, modify the packet, blah, blah, blah, so what they introduced was they moved from 64 bit to 32 bit to 64 bit, back then 32 bit was a thing, now it's 64 bit, now it could be 128 bit, it will go on, it's basically, register increased from 2 to 10, the support added, and one most important thing is, BPF system call was added, so this was, this BPF system call let the ability to make dynamic changes, means you have the ability to now communicate, yeah, not during complete, so now the new, whenever you say BPF, people usually mean EBPF, not the original BPF that was there, unless they have to say, they will say classic BPF, so if you say BPF now, that means EBPF, that's the network packet filter, so just because of the features, lots of new use cases came across, so tracing, security, networking was there, but more networking stuff came across, people are doing IR decoding, people are doing Android with something with Android, yeah, lots of stuff are coming across, so initially BPF was a technology, a software, a piece of software, now it's a platform, it's a platform, that's why the top title is the infrastructure, so you can use it to do anything, it's eating the software well, it's taking up everything, so you will someday are going to encounter EBPF, the motive of this talk is to be aware of what BPF is, and what BPF is going to do, what it's going to make an impact, where it's going to make an impact in the ecosystem that we have today, so let's see how it works, so we have a user space program, that means you write the user space program, and then it is compiled to EBPF bytecode, and you then load it using the BPF system call into a verifier, so I will talk about the entire components that are here, one by one, but keep in mind, this is that, verifier, then pass on to the virtual machine, and then you have some hooks, I'll talk about everything that I'm mentioning right now, so don't worry, and then it is loaded into the kernel, and then you have a way of communication with maps, and then you, from user space, you can try to access the data that you got from the kernel, all right, the BPF system call looks something like this, you have your command, that means if you want to create something, map or load the program or do something, you have that and you have an attribute which would take parameters for that, if you need to create a map, you have an attribute for that, if you have a program, you have a certain attribute to that, it's a union, it's a basic seating, and then you have a size, so maps are key value data structures of different type, it can be hash, it can be maps, perf arrays, so how it does, how you can interact with it, you get a file descriptor back to the map, I mean, the map that you load into the kernel, you will get a file descriptor back, so that you can use the file descriptor, just like if you have done file handling in C, so you get a file descriptor and then you can access the file descriptor to access the file by fd.open, blah, blah, blah, and then you have some internal helper functions that helps you to interact with the map because I'll talk about it why, so one program can have many maps and many maps can be accessed by one single AVPF program, the map types, I just mentioned hash, array, perf event array, CPU, hash, dev map, lots of stuff, and there is one interesting map type program array means you can, it's not possible to do tail calls as in looping in BPF programs because of the verify instruction, I'll talk about it as well, but you can chain programs back to back, but there is a limit as well, so you can only chain up to 32 programs so that you can have more features to the program, alright, so internally it looks something like this, there are tools and there are libraries, there are proper libraries to ease the way you write in the BPF program, but this is how internally it looks like as in system calls, this gets converted into system call as well, the BPF create map, I'm just showing you how the attribute matches to the call, so map type, key size and value and number of entries and yeah, it's like that, and then you access the map where the internal functions such as lookup and update, then you have helper functions, so helper functions are there in the kernel, they are there to ease out the things that you cannot do easily with just a couple of functions, so it's a bit hard to do it with the BPF, so there are a couple of functions that eases the way you communicate with maps, so that's why there are BPF lookup elements and update elements, there are lots of help functions to ease out the task that you cannot do normally, and then you have program types, so why do we have program types is that BPF have hooks into a lot of subsystem, it will be security, it will be networking, it will be tracing K-probe, trace points, hook ropes, lots of hooks are there, you can create your own hook and get it merged into the kernel, so whenever an event is triggered and you have a hook into the kernel event that means that whenever the program is run it would execute that BPF program and you would get, it would do the processing and it would return back to where the hook was executed last, just like a function call, you call the function and then it moves there and then it finish executing and then comes back, if you are aware of K-probes it works similar to that, you have program types, socket filter, K-probes, XTP, C-groups, lots of them, there are many coming, people are making more and more and more for their own use cases, so with load program there's the system call that converts to finally a system call but yeah, something like this and you have a file descriptor so you can manage the program but it's running inside the kernel, so many people will hesitate to run something inside the kernel, so is it running inside a virtual machine? Would that make it safe? I don't think so, so for that making sure your program is safe and executes in a safe manner, you have a verifier which does memory checks and checks for the program termination that means the program executes in a finite time so that the kernel doesn't loop, so you have cyclic checks and then for the step two you have a simulation where each of these stack changes are checked for and made sure that okay, there are no initialized registers, yeah, things like that, there are also work going on for loops which is called bounded loops but it's under progress so that the BPR supports loops which is a basic programming feature, so yeah, this is how the architecture looks like that I just tried to explain, so you have the BPR system call which would load the program into the verifier and the verifier would load the program and then according to the program types you have certain hooks, if it's a tracing hook, it will trace to the event kernel, event that you're trying to access and then you have, it will run inside the virtual machine and then you have some data like what's happening inside the kernel and then you have the BPR maps, you can read the data that you have or you can do more processing if you make use of the data, let's say if you're trying to set up a load balancer maybe, you can see the data okay, something is wrong and then you change the configuration or kernel parameter that might be required when you have high load and when you have low load, I'm just giving an example, you can do a lot of stuff, so how to use, you need to have the updated kernel, get as much new kernel as possible, the features are added step by step and more features are coming in, yeah, so the original BVF looked something like this, you have a C program fragment and then you would load the program with sexhucketop, yeah, so this is not writable or readable, I mean this is what the, if you try to do TCP dump, hyphen I, what number would convert to a C program, so you would not want to do that, that's why TCP dump is there and then you have EVPF which is still messy and then people try to add support, LVM or backend support, that would mean you can write limited restricted C with EVPF, so whatever that this code is here is equal to this code, so it simplifies the code more and then you have more wrappers, more way to write VCC program, VPF programs, you can write in Python, Lua, Go, people are writing lots of tools in Go, yeah, you have an inline C code but then you would do the processing in Python, yeah, so this are some of the BCC tools that people have written for tracing, this is specific for tracing, this was 2017, you can do a lot of stuff safe, so these tools are much much less invasive than the existing tools used for tracing such as, let's say you want to do get host, trying to find a time a DNS query takes, so what you can do is I'll just do in that case you can see the latency, so you can configure the tool in any way you want, this is under 200 lines of Python code and C code, so you can write your own tools as well and as compared to the other tools which does similar to that, similar to what this tool is doing, they have to make communication with the user space and then back to the space, but this one, since it's running inside the kernel, the context switches are very less, so in that case it's less invasive, you won't have overhead of running a monitoring tool on your system and it's so easy to write a tool that Brendan Gregg who is synonym with performance, he wrote a tool, he wrote the tools in red for his new book that is released just now, if you are into performance and improving the performance product you should purchase a book or get a copy, yeah so it's so easy to write a tool that people can pick up and get it done, there was a case that I read yesterday about Chef storage, so somebody was trying to debug the block IO latency, the time it takes to write to disk, so they were seeing that there was some issue with Chef new cache storage and they were seeing very high latency, suddenly with the new Chef platform and upon debugging they thought okay this might be Chef issue, the new cache issue implementation will be there, so they kept on debugging so in the above we were also discussing that most of the monitoring that we see in the graphs and the metrics that we see may not point to the exact issue that's on the system, so we have to go to the system and see what's happening, so this was a similar example, so upon debugging they found that it was not Chef issue but it was a kernel issue, it was queuing up some writes for no reason apparently, so they were able to pass the kernel and fix that issue, so yeah using this tool it's pretty easy to find and pinpoint the issue, so BPF phrase is a high level DSL, means you can write even more easier easy to write programs that means you can write just like SQL, you have select statement which is pretty easy, you have BPF phrase that gets converted to even single line strips, so this one does the syscall enter, so it would count the number of programs that did a system call, so you can trace that or you can write it in a strip so that you can get a format on your own, maybe as this program is trying to find is the number of top 10 system calls that's been performed in the system, yeah if you are aware of system tab most redactors here might know, recently the 3.2 version converts to BPF internally, so system tab where you have to reload the kernel but in this case it seems it's running inside the watching machine, you won't have to do that anymore, all right so moving on to next segment which is networking, I focus on XTP at Red Hat, so it's an BPF hook that means it's a BPF program that is running very early in the network stack, that means just the packet reaches the driver it kicks in and start to do the packet filtering, the stack looks like something like this you will have your network hardware that is the drivers and immediately XTP is invoked if an XTP program is there and then the escape of is built and then somewhere at layer 3, layer 4 you have IP layer, TCP layer, so the good thing about XTP is that it doesn't bypass the network stack but aims to work with it, so that would mean that you have all the features and security features that you have with the kernel, the existing headers and all the structures you have that unlike other performance, high performance data process as CPTK which have to implement its own protocol and all the RT stuff, yeah so you have a couple of actions which are drop, pass, TX and redirect, so this mimics something called as IP tables, if you have heard about IP tables I'm sure you might have, so IP tables does something like this, it accept or pass or it has a lot of features like that, so XTP also tries to mimic what it's doing so we were trying to do some scenario testing with IP tables comparing it with XTP we are using the newest kernel for testing this scenario out, some of the system configuration we have 40 gigs dual port card, we are running 4.18, so the test setup looks something like this, we have two systems connected back to back that is connected back to back to wires and we are sending traffic stream both ways that means traffic generator is sending traffic from both the NIC interface at a very high speed and at the interface, thanks so much, so yeah we have one time we try to load the IP table program and then we second time we try to load XTP program and then we ran the test the traffic flow is both ways and we have a set of threshold that needs to be maintained for packet drops that means it's if it's above 0.002 percent that means the test fails so we have to ensure that packet drops doesn't happen because it's a DDoS scenario so we are simply dropping the TCP streams this is very initial test and then we are passing the UDP packets and we are measuring on the traffic generator how many packets have been received back it's a forwarding test so the results there's something like this with library tables we were able to close reach 1.3 million packets per second this is for a single CPU single queue and with XTP we were able to reach 5.59 million packet per second and then for scaling we try to measure how it performs on the queue yeah so this this tests were very interesting to see the IP tables raw mode is added because raw mode kicks in very early so to just give a edge to IP tables we try to see how how it fails well with similar technology because XTP comes in kicks in very early so we had to make sure that we had a good amount of work so there is a BPR filter branch that is in the upstream kernel which is trying to convert IP tables rules internally to BPF we are JIT so any network windows such as Melanox supports it and then you can you can get to see XTP hooks and TC hooks and you will see a lot of performance with the IP tables and if you're running Kubernetes application you're you're you hope to see a lot of performance boost there so in routing case there was a kernel function added which is the FIB lookup that is the forward information base which does the routing and then you have your your packets will the kernel will try to read the table there and decide where to power the packet with XTP people did a prototype and with the normal users they were able to do 1.9 packets per second for forwarding but with XTP they were able to get seven seven million packets per second that is a lot of boost it's just 345 lines of changes and then Facebook is working on I think already released it's an open source project called Katran it's a solution for layer three layer for load balancer and XTP can also be used as a one-legged load balancer what that means is that you don't have to have a system for a load balancer an entire system for a load balancer which would save you money you can run it on the system itself which is running the application because it will be very early in the stack so you will have it will be less invasive yeah so regarding monitoring I've talked about networking because my focus is on networking but monitoring is a very interesting topic I think many people are interested in that Prometheus I discussed with with Gotham yesterday about eBPF exporter there is a Prometheus exporter that does that what you do is you give the eBPF code to Prometheus and it will collect that information and it will be just in fact so you can analyze that and then one interesting thing I got to know with Sapnil his talk was before me so they this this program is very simple basically what it's trying to do is get the information of all the IP address that's there that's making a new connection right so the if you have attended the skimming talk by Arjun he mentioned like you might not know what all the calls are made outside your once what's its point when the JavaScript file is poisoned or changed or modified so it will make a call outside your trusted IPs so using this you can try to find okay someone is not your maybe your JavaScript dependency is sending it outside that you know that you're not aware of yeah so this is a debugging tool but you can also use it as a monitoring tool and this gives the AS info as well that means the autonomous system from where the connection is coming from yeah so the BPF ecosystem is growing and it's very active six months ago it was something else right now it's something and six months afterwards it will be totally different so my talk will be absolute my job will be absolutely so a couple of interesting thing that people are doing I just mentioned networking and tracing as I mentioned that it's eating up the software world it's coming up everywhere if you are aware of fuse file system it's used in cluster so as well so somebody tried to do a bit of research and research work and they what they try to do is it's a user space file system that means it has to do a lot of contact switches so with the BPF they did all the avoided all the contact switches and a lot of performance as well with Android you can remotely monitor Android devices so usually the Android kernel is not that you cannot modify easily you have to know that stuff with but with the BPF you can extend that kernel and add more features that that are not there or you cannot add with software directly into the kernel it has to be merged in the kernel but with the BPF you can extend it by adding feature and then there is one another interesting thing IR decoding infrared decoding that somebody added to the kernel so the diversity of things that people are able to do with BPF is very very large so I'm reaching the end of the talk almost 30 minutes so you need to get the updated kernel so that you can play around with the BPF check out the BCC tool as well there is XTP tutorial as well so you can if you if you want to try out you have if you have any use case specific to networking such as balancing or routing can play around with that and see how it works with the system maybe you want to change masquerade something but you don't want to use IP tables which would be a lot of overhead you can do that tracing networking security as well for SecComp if you want to add yeah make sure your system is secure this are some of the resources but the first but the first one is compilations so you can get most of the details there yeah thanks so much any questions how does BPF help in reducing context which is infuse infuse because the you have the check out the link first so what they what they're trying to do is they the it's it's happening inside the kernel right so the communication with the block devices and all that it say you have to do a lot of aggregation so that that instead of doing multiple contexts which is you can aggregate and then send the data or cash it's something like that so the sense that you will not so you would you ever end up modifying data being read as part of a system call using that's a good question but to be frank my expertise is on the network inside that's why I'm even with networking I was thinking so with networking it's possible that you'd be you know redirecting your package to some place that's what that's possible