 Okay, our next talk is going to be about BPF as a revolutionary technology for the container landscape by Daniel Borgman He's one of the BPF kernel maintainers and That's right. Yeah, thanks a lot for the introduction and it's really great to see so many people here that the room is fully packed So first off I want to start with the container landscape It's I guess obvious to everyone the containers exist for a long time, but like how do the production? Deployments look like and I found a nice usage report from Cystic folks where they looked into various aspects and I picked out three points from there one is the that I found very interesting and As a trend and one is the lifetime of a container and we can see here that there's a continuously decreasing lifetime many of the containers seem to run as you can see here on the chart for 10 seconds or less and That basically doubled over the last year and that trend seems to be increasing The other thing is how many containers are running on a I hope you can see the numbers good here How many containers are running on a on one single note? And that number also has doubled from the last year and now it seems to be Ferdy as a median And the last one I guess it's pretty obvious, but what is the main orchestrator if you? Deploy containers and that's Kubernetes so in including if you include open shift and Rancher here as well, then it's 89 percent. So that's quite vast majority and as a common Denominator as common base platform for all of this is of course the linux kernel and it has to provide all the building blocks that we know For the isolation in terms of namespaces So you have various namespaces network namespaces recently time namespace could merge into the kernel and also for resource management Which is done through the C groups It has to provide the network connectivity in case of the network namespaces It's either to reef devices IPV land devices or various others and security policies That you can implement and that can be done in terms of learning security modules or firewall policies and so on and so forth and all of this also has to withstand all the Increasing scalability needs and the higher and higher turn frequencies of containers while also Coping with the subsystems that exist for a very long time in the kernel for 10 years or more and that have been designed back then and of course the most current has to Paradigm of never break user space so all of this has to still keep working while you optimize all of it and Just to pick out of Two examples in networking so there's TC and IP tables netfilter. They both have been designed long ago and They're both generally extensible, but at the same time their framework is inflexible for today's needs because In case of TC and IP tables like there the whole processing pipeline is part of the API contract with user space For example, TC has classifiers and actions You can load them and you really have to Specify how to pack at this process if it doesn't hit this classifier Then you create another classifier instance And if it matches this then you have to go through this action and similarly with IP tables you basically In the worst case end up traversing like a linear list of rules Did you have and yeah, so if it's getting really complex then basically it slows down your fast path because it's sitting right in the middle of it and System software today it basically has to Support the right range of kernels because people in production are running Old kernels or the latest kernels ideally We would love that everybody's wanting the latest table, but that's like Not always the case and then basically you end up baking policy In like deeply into your code base and there's significant effort if you want to rewrite it Just a random pick and look network. You basically encodes The arguments for later on shelling out to IP tables and it's not really great and I found this xkcd Where it says yeah to itself to the future self if you want to fix it You have to rewrite it and it's kind of creepy, but yeah Not only lip network, but also kubernetes relies a lot on IP tables specifically for its service implementation true Q proxy and There are also Issues in terms in terms of scalability So some of them are for example low and unpredictable packet latency because if you have many different services And the way that IP tables is basically programmed and traversed You end up either hitting the first rule which is really the best case or you have to traverse all the different rules And then you end up at the last one and that's not really a nice outcome and There's no update times because you have to replace the entire blob of rules and this can really take a long long time if the kernel has to pause and make sense of it and Basically install the rules into the data path There are also reliability issues I will get to that in a bit an inflexibility because what I mentioned in terms of the contract reduces base In terms of performance even when you do like basic measurements From host to host you will see that in a lot of cases even with just a very few rules In the kernel if you run perftop to see what are the heaviest hitters in terms of CPU cycles that are consumed that it's often case on IP tables and There are also other issues so probably many of you that run Kubernetes in production might have hit this bug Where you got basically DNS delays of five seconds or more because of a connection tracking race My colleague Martinez. He submitted actually a patch In August 2018, but the patch could only merge like around half a year later And he even did some research like the first occurrence in the buck was actually in 2010 That people were hitting it. It's fixed now, which is really good But like until it really goes down all their way downstream to users in production. It takes a lot of time actually There are also some other issues along the way. For example, I copied here the documentation from Kubernetes for cube admin. So now IP tables is being replaced with NF tables, but they're basically Issues with the don't break user space part where they recommend that you should Yeah, basically run the old IP tables binary and don't upgrade because rules are getting installed twice which breaks cube proxy In terms of the bug ability if you only have a few rules, that's good. I think that's fairly easy to follow you can also See like the counters if you Write IP tables save their C So it's easy to debug, but if you have a lot of rules, for example here in the left side I we deploy it basically a hundred services only All with the same back end, but that already gets complex to follow like where my packet is dropped right and the other thing is also that Jeep proxy periodically resets the counters and IP tables. So yeah, good luck with the bucking. It's To show you the basic packet flow Basically you have an SKB at the beginning it hits the TC ingress path and then later on all the yellow boxes are Netfilter related you do a flip lookup and then if you want to push the packet back out the kernel you hit the TC egress path and then you send it to the device and Just to show you an example how cube proxy Installs the rules with IP table so that you have a clearer picture of it if you for example create a service engine X in this case Which has a virtual class the IP of 3d 3d 3d and it has two endpoints then basically how it looks like an IP tables That's that's the way how cube proxy installs it. It basically first creates a Rule in the NAT table for the pre routing where you redirect all the new traffic to the cube services chain and Then basically there you you will check for the virtual IP and if that's happening And if that matches you basically go to the cube service engine X and there you have like a random selector With the probability of to be like because you have two back ends and then you either select The first one or the second and then basically packet goes on right and if you have many many rules You basically have to first match on all of the service IPs before you end up at the right one But you can actually optimize this path by only going from TC ingress to TC egress With the help of ebpf. I first mentioned in the beginning that TC also has issues in terms of With exposing the processing pipeline, but the module that we wrote for ebpf Basically bypasses this because you can do all of it In ebpf and then directly you return a verdict without having to process many many TC rules there So it's actually efficient And I would say it's probably the only efficient Module that is there in the TC software path How does it work basically so you have a piece of code it's similar it's C like syntax You can compile it with LLVM LLVM has a BPF back end And That generates an object file. There's a BPF loader in user space for example IP route to or There's lip EPF which is maintained in the Linux kernel tree as well And it basically creates the maps it links the maps into the BPF program and in the end it calls a syscall For the for loading the program then in the kernel it goes through a very fire very fire Make sure that everything is safe and sound and then basically It hits the just-in-time compiler where it translates BPF instructions into native CPU instructions and basically all major Kernel architectures they have support for it these days and then it creates native code and Basically your user space agent that you have can access BPF maps to update state at runtime or to fetch state from the BPF program if it wrote into maps and Again like if traffic is coming then it hits the BPF code and it can either redirect it to another device or basically can drop the packet that's also possible and An agent itself can also like periodically update BPF programs. So that's Yeah, that's the that's the basic workflow there and so yeah, so why is BPF a radical shift in Because you can get full programmability, right? So it allows you to It allows the user to tinker and to change the kernel Data path in any way at once, but still with the safety belt on it's not like kernel modules where you can potentially Crash the kernel. So I've been talking a lot about networking and this whole talk is actually about Container networking here, but it's also There are also different areas like tracing and security tracing You probably heard a lot from brand Greg where he's doing great work there and security There are recently proposals for the no security modules to make it programmable with BPF as well The networking it allows you to fully to find the forwarding pipeline, which is great And the thing is like BPF has a stable API guarantee So it's pretty much similar to assist calls where we don't break it once it works and you and it verify your passes And the other thing is you get native speed. So it's similar to kernel modules there You can update your BPF programs live at runtime without having to reboot anything without having to even tear down services So it works without service disruption and it's really designed for performance and production use cases, right? BPF is basically developed mainly by Facebook and some folks from from our side Google Red Hat Cloudflare many many people and it's really great That there are so many contributors. There's a really a long tail of 280 Seven contributors in the kernel since the beginning and they're also large-scale users of it So Facebook is running it in production and all of their infrastructure whenever you hit facebook.com. You will go to BPF Netflix are using it in the large-scale for tracing Google is using it in terms of traffic shaping to optimize the TCP stack and many others cloud fare folks they're using it for the load balancer and DDoS protection in production and it's even Available on rail kernels, which means something because it probably means now it's mainstream because people are backpotting it to rail which is a good thing and Yeah, and even like the old IP tables maintainer rusty Russell. He admits that yeah be tables used to be good enough But these days there's it needs a radical shift And that's basically BPF, right? So how does this all linked back to? Kubernetes networking So basically there's sodium, which is an open source project. It's a CNI for kubernetes And it implements its full data path in and BPF It supports the kubernetes service implementation, which is the focus of this talk, but there's also many other things there network policies Multicluster accessibility encryption and so on and so forth so you can look it up. It's all open source on github Yeah, so how did we get to the bath and sodium of replacing cube proxy with BPF where you can basically then end up deleting the cube proxy demon set and don't have You don't even need your long list of IP table rules anymore then so basically to Give you an overview of what cube proxy does you basically have different services One is called cluster IP cluster IP basically allows you an in cluster access of a virtual IP Where basically you can either access this from a part or from the host namespace There are services which are called notepad which allow the access from outside But also at the same time from inside you can you can view it as an onion so for example the notepad also makes use of cluster IP and Yeah, there's also external IP which means You have an access to an external IP. You can basically override it with your own back ends it also again Works on top of notepad and there's last but not least a load balancer service Which is usually When you have an external cloud provider and then it redirects buckets to your either external IP or notepad services So you can access it from the outside world, right? And Insilium basically in prior versions we implemented The whole thing in BPF the following way like the cluster IP if you have part to part connectivity If you basically connect to a virtual IP like going back to the 3d3 example from previous Slides with the engine X service you basically have your own part here Which has an its own network namespace you connect through it through reef devices and basically in in on this site You're hitting the TC ingress path in the host namespace and basically their BPF is attached And what it does inside BPF in the BPF program is basically a service map lookup it checks whether the Whether there's an endpoint for that address and then it does the D NAT on the back end in BPF it creates a connection tracking entry So that it can make sure that the replies are coming back in on a translated correctly and here you can see like the Content basically of a service map. It has different back ends encoded here and For the connection tracker we can basically make sure that whenever replies come back in that there's connection tracking entry It will hit it. It will do the reverse net and go back to the original client. So, yeah, that's like the basic path and How is this all plumbed down from Kubernetes to sodium basically a sodium listens to events from the cube API server and there so whenever There are service updates it listens to it for example here and then back end updates And then it basically pushes this down into this BPF service map that is sitting in the data path, right? So recently we reworked the cluster IP access with something even more efficient So the Linux kernel has support for attaching BPF programs to C Group v2 There are different ways you can attach to it So just to walk you through it so if a client again connects to an engine X service then basically we already attached to the connects call and We can see that The client is trying to access one of the service IPs. We can already do the map look up right from there and Then we can basically rewrite the destination address to one of the back ends And that's much more efficient because you don't have to do actual d-net You don't have to mangle packets in the data path You can just rewrite the address that was used to pass down for the connect system call So basically the kernel there So basically it looks like as if the application directly connects to the back end, but it's still like the kernel Makes it think that it's actually connecting to the service IP Right and the same for connected UDP So there's connected UDP and unconnected UDP in the kernel and you can do the same there from the connect hook and If you have unconnected UDP where you only use send message and receive message you basically don't have any state and There you need like a small translation table for those two hooks where you can then If the packet goes up like back to the application to receive message where you can then rewrite the Back end IP with the service with the virtual service IP, right? Yeah, so I think that's really really powerful feature in saves overhead Then the case of node board services. So basically here we have this external client it tries to connect to a service on the nodes IP within in the node port range and The packet arrives on the physical device BPF program which is sitting there. It does the service look up and it is automatically then doing a D-net to one of the back ends it checks whether it so Cilium has the knowledge Where the back end is local or remote and it checks when it's local then it just redirects it into the parts Network namespace so that's the trivial case where the back end is basically local to the node if you have the more complicated stuff where it's remote again, then you will connect to the same nodes IP address where there's the BPF program running and Then it determines a house or the back end is not local I have to go I have to go to the remote node and basically it's doing a First the D-net and then it's also doing an SNAT and that SNAT is also done in BPF So we basically wrote a net engine in BPF. It's doing the translation and when the reply comes back It has to do all of the reverse SNAT and D-net and pushing the packet back out and that basically works We have it in in our sodium code base merged and it can go over Direct routing or also with over a tunnel mesh in sodium Yeah, so all done in BPF then there's some Something that's called external traffic policy local in Kubernetes where you basically say That it can only connect to those back ends that are actually local to the node So you don't take this extra hop and basically the packet would be dropped in this case if it tries to connect there And something that we merged just recently in sodium is also direct server return So there will be basically save the overhead of doing the SNAT and Whenever a service again tries to connect to a back end that is remote to the node basically we encode the The service IP so that basically the back end then in BPF can rewrite the packet With the source as the original service IP and can reply directly to the to the client without Going back to the to the other node for this extra hop So that's more efficient even and that all happens inside BPF without any of the other dependencies needed So we did some performance benchmarking on AWS With SIUV networking and compared basically the whole thing to IP tables and IPvS and We were running net births TCP CRR TCP CRR is basically a net birth test Where you're you're establishing the TCP handshake you're exchanging one byte of information then you're already tearing down The TCP connection and it's doing this as many times as it can And we were basically running this against the node port service And as you can see here with the default baked in cube proxy implementation You're really hitting the limits if you have 2700 services That really is noticeable in terms of overhead because you have to traverse all the lists and IP tables Yeah, and it's it's it's being done for all the new connections, right and Yeah, so The second benchmark is basically the TCP RR Which means that it's creating a TCP connection and then it's only sending one byte of data as a ping pong And then tearing it down so you don't have the new connection every time so you can see here that IP tables is actually Performing not as bad as before But BPF is still the clear winner and it in the DSR implementation where you save this extra hop Gives you the lowest latency The other thing that we are currently working on right now is basically to move this Extra hop that we had where the back end is remote to the node into the XTP layer is basically Another hook for BPF in the system where you can run BPF programs right at the driver layer Without even having the overhead of allocating a socket buffer in the Linux kernel without even having to go through the GRO engine and Higher layers so it's it's been running at much earlier hook than just the TC ingress and We can already there if we determine that the back into the back end is remote We can already there push it back out the driver again without having to go into higher layers and All the three cloud providers now they have finally support. So the last thing that got merged was from What into the kernel was the in a driver that it got XTP support so you can run it also an AWS with SIUE networking and this chart here is basically from a Presentation from net defcon from Facebook folks. So they're used for a long time IPvS for their main L4 load balancing and Basically, they switched all their front-end load balancer to BPF and XTP in production and They see a much better throughput. So they don't publish concrete numbers But they said it has a performance gain for 10x and more which is quite impressive so yeah, so that's where the Silium 1 8 release we have scheduled that to port it into XTP and Basically to recap with with the people with the whole Data path in BPF you get a much better performance and lower latency. You can have faster service updates because you don't have to go to To the to the to the net link and you don't have to replace the entire blob and You just do like a BPF system call for updating BPF maps It's much more reliable. You have less lines of code overall And you don't need to wait for a new kernel if you have a fix that you desperately need in production You can just batch it live because you can just Change the BPF program on the fly you get much better visibility I didn't really talk about that in this talk here, but there's something called the peripheral in buffer output where you can basically Export custom data to a birth to a high performance per ring buffer and then collect all of this in user space and It's not it's not subject to any Kernel UAPI restrictions or something you can really define your custom Structs and see that you want to push down there and then we can also correlate traffic with container traffic for example You don't have to show up to IP tables and you're able to much more customized your data path You can change basically the behavior on the fly which is great and it's fully integrated with the rest of sodium So this aspect only covered the load balancing integration, but there's also policy and much more in sodium So yeah with that said If you want to try out the cube proxy and free How to it's just like three or four comments to run it There's a cube admin integration where you can deploy cube admin without cube proxy And then you can basically run sodium and enable it with the cube proxy free feature The code is all on github if you want to check it out and there's also a slack community. So yeah, thanks for that Do you have questions? Are there any questions? Sorry privilege mode, so does it require privilege mode so it is working like on a From behind so you so the question is whether it requires privilege mode or yeah Yeah, so most of the PPF features they are in privilege mode. So basically the sodium part That you deploy runs Basically on a host namespace as well And it has to install all the PPF and manage all the PPF, but that's basically privileged We disable the unprivileged PPF by default. So Other questions Yeah, it's the first time that I'm I'm seeing a PPF if I understood well, it's all in user space And I'm concerned about security because you never talk about security. Is it possible for a? Container to tamper the rules or something similar So basically all the PPF programs is a loaded into the kernel They have to pass a rare fire and the rare fire does some strict security checks it checks for example Whether all the types are saying whether you don't do out of bound access whether you so basically all the stuff that you cannot Crash or destabilize the kernel and it even goes that far that You you've probably heard of all the spectra stuff it even goes that far to mitigate potential spectra issues In by rewriting parts of the PPF program that are loaded but still to keep the logic the same so it's really doing a lot of work there to Make sure everything is sound and safe Other questions So while you're walking if you want some stickers, there are also some PPF stickers and so you Have you done any performance comparison with the good proxy free project? When you have a small number of services, but the number of clustered nodes increases like is there a performance gain there? So when you have a small number of services, but the number of cluster nodes increases, so there's still a performance gain because the IP tables hook comes at a much later point in the networking stack compared to the DC ingress so DC ingress is basically run right after GRO is executed so it's still earlier and also all the PPF stuff is just in time compiled It tries to void all the indirect calls as much as possible As opposed to IP tables, so there's it's still a better performance and there's still less overhead in the fast path Okay, we out of time. Thank you very much. Thank you