 We're going to continue with the team now. Thomas Straff is going to talk to you about swimming. You heard Ben just previous talking a little bit, so now we're going to take a bit more. Thanks a lot. This is probably going to be the first time they actually sound like Dolph Rader when I'm talking. Before I get started, I want to shout out and thank Dave, Thomas and Charles for actually organizing this. I think it's an amazing lineup of speakers, and I'm always impressed how many people actually show up for talks here. So thanks a lot to them. Yeah, let's give it a round of applause. Yeah, absolutely. So what I want to talk about today is Sillum, and it is an implementation for container networking or networking in general, which is leveraging BPF, so the technology that Daniel just talked about. Before I dig into the details, I want to basically give some insight on how we see the future of networking. I'm sure all of you are familiar with SDN and how we got to SDN, but what could be the next step? Is SDN the ideal model in particular for software networking or networking done on the server? And I personally believe that the existing flow table based model works very well for hardware and works well for virtual switches, but it's not necessarily the best model if we have programming languages and we have chip compilers and we have the full flexibility. On a CPU, we can do anything. Why do we want to limit our software to technology that has initially been invented for fixed ASICs? So the future of networking, I see BPF programs getting generated for every individual endpoint, and an endpoint could be a Linux task, a C-group, a VM, a container, a pod, whatever. Anything that basically communicates with our packets can be an endpoint. And when I say return to the program, what I mean with that is we will generate a program that does all the networking. Do you guys hear this echo thing as well? Otherwise I will just do it without the mic. All right, so we generate the BPF program that will do all of networking and security for that particular endpoint. What does that mean? So if a container is sending a packet, we know the MAC address of the container. We know that every packet leaving that container needs to have that fixed MAC address. So instead of configuring this somehow, we basically provide a program that has an if statement in a program that says if the source MAC is not like this specific MAC, then drop the packet and so on for everything. So instead of flow tables, it's actual programs that we leverage. And this is showing the example. On the left, you can see a flow table. This is showing an OBS flow table example. Basically, you either have fixed static flows or you have multi wild card flows, which describe a packet or a set of packets, or a particular one flow, and then a set of actions which are executed for that flow. On the right, you see the opposite, which is a program. And the program defines the behavior for every packet that is flowing through that program. And this example is showing basically the if statement that says every MAC address needs to be the MAC address of the container, otherwise it cannot leave the container. And the second one does an L3 look up through a hash table and then performs an L3 action. So these are just to make sure you understand that the two different models. And to recap, this is basically what Daniel told about, which is how can you attach these BPF programs and actually apply it to networking. Then you'll talk about the tool chain on top, which is pseudo-c code. You can use LVM BPF back end to translate pseudo-c code to BPF byte code. You can then take that byte code, load it into the kernel. The varifier ensures that you're not actually crashing the kernel, and you can then attach it to various hooking points inside a kernel. The ones we actually currently leverage are TC ingress and egress and XDP. So we get the capability of passing all packets that leave a container or go into a container or that we see on the wire on top of a VX LAN and kept device or something like that. We can feed all of them through BPF programs. So this is basically the one BPF program per endpoint visualization. The key element is it's not a single BPF program, it's one BPF endpoint per endpoint, which means that program only contains what is essentially required for that endpoint. To give you a couple of examples, if an endpoint will only ever talk IPv4, we will definitely not compile in any IPv6 supported all into that program. If an endpoint requires part mapping, for example, a task is running on port 8080, but it wants to expose itself with port 80, we will compile in the instructions necessary to do the port translation. If that is not needed, we'll simply omit it out. Another example is if we require a policy enforcement, we'll compile it in. If you don't need it, we'll compile it out. This means that instead of having a configurable pipeline, we have a programmable pipeline that always only contains the minimal amount of code required. So this is the product of all of this. So we've heard several TPTK talks. I'm sure all of you are familiar with the Linux kernel and the networking capabilities thereof. What this is trying to achieve is basically find that sweet spot in the middle, which is the flexibility and the extensibility of user space networking, and also the performance of it, because we can write the program with the exact requirements that we need. This is where a lot of the TPTK interests is coming from, because you get to define exactly what your application is supposed to do with packets. At the same time, we don't want to leave the kernel. We want to stay within the kernel to leverage the hardware abstraction. We want to leverage the safety model that the kernel provides, and we want to have the reliability of the kernel. The kernel provides a BPF verify which ensures a BPF program cannot crash the kernel, because the consequences can be very severe. If you have a bug in your protocol parser and you could trigger that remotely through a packet, a single packet could essentially take down your entire data center. It's definitely something that you want to avoid at all costs. So some mechanism needs to be in place to ensure that the safety can be guaranteed. And the last aspect which makes this very flexible is that these programs, they're not just generated once at start. They can actually be regenerated at any time as you see fit, and you can recompile and reattach them into the kernel without losing any state. This means if something changes in your environment, you need to change the program. You can do so, replace, automatically replace the program without losing any connections. One simple example is, for example, you have a load balancing function which has a particular backend selection which, for example, is using the packet's hash. Now you want to go over and use list connected. You can regenerate the code, and the code will only include the new backend selection, automatically replace it, and all the connection state is still in place. This is because code and state in BPF is separate, right? There's BPF bytecode and there's BPF maps, and the maps contain the state. Another big benefit is that for networking, we have been using, that networking is networking only, right? Networking is supposed to happen on packets, otherwise it's not networking. But if you look at various use cases of networking, one example is connectivity policy. A can talk to B, yes or no. Does it really make sense that if A is sending or if A wants to talk to B to actually construct a packet just to drop in again because of policy? It doesn't make any sense. What we want instead is to actually deny the system call that is causing the packet to be constructed. So we can start thinking outside of the box of additional networking and start connecting concepts such as attaching BPF programs to system calls to trace points and so on and actually provide the actual functionality that the application developer wants at the right level instead of trying to solve everything at networking level. I understand you'll talk a bit about XDP. XDP gives us the ability to run BPF at the driver level. So very close to the hardware. We can achieve the DPTK claim to wire speed in the very same way. So this is definitely interesting for things like load plancing, dropping packets early and so on. This is probably the most obvious one. If you can generate a program, you can actually turn a lot of configuration into constants. So things like the container's MAC address or the tasks, MAC address, IP address, all of that will just become constants in the program and the compiler can optimize them. So literally like an IPv4 address will basically just be loaded into a register before it's written into the packet. So there's no cashmase in looking up the IP address of the container first. It's in the program, the compiler can optimize it heavily. Another one is that we can pick the right data structure on the fly. What does that mean? It really matters. The data structure is a best fit depending on your specific need. If you have service A, service B, your task A, task B, they talk to each other and B has a list of two allowed consumers that actually makes sense to unroll that loop and then code the allowed consumers directly into the code. If you have 10,000 allowed consumers of that service, it doesn't make any sense to unroll the loop. Then you want a hash table. If you have a data structure that's actually collecting data, let's say statistical data, then you're writing through that hash table a lot. So you definitely want per CPU hash tables. So based on these needs, we can pick the right data structure at code generation time. As I said, data or state is decoupled from code, which means if your program is collecting state or is collecting statistics, you can replace the program without actually losing state or data, which means we can, for example, upgrade a data path, add support for initial protocol without a single connection getting dropped. Collect your own statistics. We had huge discussions over the last couple of years in the current environment about what types of TCP metrics should be collected. This is a huge discussion because every single old statistic collected will have a performance impact. BPF and CineM allows you to basically define your own statistics collectors as you see fit because you are the only person that actually pays the penalty for it. You're not imposing the penalty, the performance penalty on everybody else. So whatever BPF is capable of matching, which is essentially everything, you can collect statistics of some sort. This is the CineM architecture itself. So I talked about the benefits of generating BPF programs and so on. CineM itself is decoupled into two components. One are the BPF programs, which are loaded into the kernel. That's the actual data path. That's where the packets are flowing through. And then it uses the ability of an agent written in Go which actually compiles the programs and injects them. The agent itself receives events from various orchestration systems. They could be Docker, local Docker at time. This could be Kubernetes. This could be Mesos and so on. And it will, as it receives these events, will generate programs as required. So if a local container is getting started, we will receive a notification attached to, generate the program, compile it and attach it. We also have several other components which sit on top. The most interesting one is probably the monitoring component which is built on top of the Perf ring buffer. It's a very fast share memory-based ring buffer, which allows us, for example, to send notifications whenever a packet is being dropped. The ring buffer is extremely fast. It can literally support millions of drop notifications per second. So you can monitor your network and policy violations and so on via high networking speeds. And as Daniel mentioned, the structure of this ring buffer is up to us to define. So it's up to us to define the actual metadata that is provided to you. So the current implementation, for example, includes the following metadata when we drop the packet. The container ID that the packet was sent from, all the labels attached to it, the container that is receiving it, the packet length to first 64 bytes packets, all this information is not visible if you use something like TCP dump or you use K-free SKB drop monitor, things like that. We can provide a lot more metadata to help LP debug and troubleshoot your network. Syllom takes advantage of BPF in two ways. One, I talked a lot about the performance benefits, security benefits and so on. We take the flexibility to actually drastically simplify the networking model. So the networking, we basically throw out most of networking principles that are out there. It's a single L3 network, flat, there's no subnet, nothing. Potentially every endpoint can talk to every other endpoint as long as the security policy allows for that. So you don't need to run BGP or anything like that. It's completely flat, right? A lot of people claim L3 only is a huge simplification. I'm pretty sure that people have never run large BGP networks. It's actually not trivial at all, right? We support IPv6, IPv4 and net4.6, which we can translate between IPv4 and IPv6. Address family just doesn't matter anymore, right? You can use whatever addressing you want. The application developer running containers or whatever will not have to care about addressing. Identity-by-security, this is basically a decoupling addressing from securities. Instead of having IP tables, ACLs or net filter ACLs, which match on IPs and ports, what we do instead is we derive the identity of the packet based on, for example, container labels or pod labels or C-group whatsoever, give each endpoint an identity and then we attach that identity to the packet so it gets carried over the network, which means we can enforce security policies decoupled from addressing. This scales extremely well because an identity is just a number. So on the receiving side, it's literally just a hash table lookup that decides whether a packet is supposed to be delivered or not, which means the cost of policy enforcement is the same whether you're running one endpoint, 100,000 endpoints, or whether you have one policy loaded or 10 million policy rules loaded. All the complexity is in the identity generation. We can do transparent service routing, which means we can redirect all packets that go in and out of an endpoint through a user space proxy, which can do service routing, service throttling, API throttling, circuit breaking, and so on. This is literally basically injecting an envoy proxy into every endpoint. We can do transparent end-to-end encryption, which means if you have applications which have non-employment at TLS, we can do the end-to-end encryption for you, which means anybody between you or the server and the other server that is actually receiving it will only see encrypted packets. The packet will only go unencrypted from your task into Cilium, and it will be decrypted transparent again. This is showing a policy example. It's showing a very simple application, local answer front and back ends. We're attaching a label to each of them and basically allowing a user to talk to the local answer, local answer talking to front-end, front-end talking to back-end. Very simple. Each of these will get an identity and will enforce the policy. What we can do on top is we can also add environments, which means you can take that policy and can now apply it into, let's say, a production environment and a QI environment. The way to do that is very simple. You can say any endpoint with the label production, in order to consume that, you also need to have the label production. Many endpoints with the label QA, in order to consume that, you also need to have the label QA. Given this, this is something that is very simple to understand. You can build very simple constraints with tools and with metadata that everybody understands. In the back end, all of these get translated into identities that we can then enforce. The main question I think that everybody might be wondering, given that we have a couple of DPDG talks, is the kernel actually fast enough to do all of this? This is kind of a provocative slide, I would say, because I'm probably doing the same thing as a lot of the DPDG slides are doing. This is showing one very extreme end of the spectrum, which is this is the TCP stack performing endpoint-to-end point on a single node. There's no hardware involved at all. This is using GSO and GRO heavily, but this is trying to make the point that the TCP stack itself on the kernel is not necessarily slow. It depends on what you look at. If you absolutely need to see 64-byte packets on the wire, then the kernel is simply not optimized to handle that. If you care about throughput, if you care about TCP performance in general, the kernel is very well capable of handling that. We can do 70 gigs of TCP traffic on a single core as well. So it's not about kernel versus in kernel versus out of the kernel. It's about what kind of programming technologies that you use to process your packets or whether you want to process every packet individually or whether you apply something like GRO to actually build a giant frame that you can then pass through the stack at once. And with that, I want to leave a lot of time for questions before I go into that. This is the point to the GitHub repo. We just got started a couple of months ago. This is still very fresh. Please also feel free to reach out to me on Twitter. We also maintain a Slack channel. And yeah, one last thing. We do have stickers. I will leave them on the table because there's talks afterwards as well. Feel free to get a serial sticker on your way down. With that, I think we have about 10 minutes for questions here. You made a statement earlier. I read an earlier slide where you said you actually inserted the identity into the packet. So could you elaborate a bit more? Is it where you're putting it in the packet? That's always the first question. So PPF has all the flexibility, so we don't care. Right now, the obvious ones would be looping encapsulation protocol, like extant tunnel ID or GenF TLE. You could also put it into the IPv6 flow label. Maybe you're abusing your own protocols and you have space somewhere. They'll be running quick on top of UDP. We don't care. PPF gives you the ability and the flexibility to inject it anywhere. We just need to know where those bits are stored. You can even store those bits in, let's say, another framework, like a DBDK application, and we can read it. As long as we know where the bits are set, we can use them. Other questions? Yeah. Well, you have to start with something that's poor, right? So, do you mean PPF? Is the future always working? Do you have competition? I don't know. Do you mean we should have PPF from DPF? I think, so when I say this is the future, I think the model is the future, and whether PPF as a technology, as it stands right now, is the future, I don't know. We'll see. I think I see cogeneration and writing particular, like, specific programs to solve, not working in security for a particular endpoint. I see that as a revolutionary step on top of STN, or as the flow table-based STN, I would say. And to answer the second question, which is the second question, should we add, like, PPF to DBDK? Absolutely. I mean, there is a classic PPF user space at runtime already. We should definitely have an EPPF one that runs on top of DBDK, so we're going to have PPF in user space as well. Vincent? Back to the OVS and EPPF, there were some prototypes as well that we know. I think there is a plan to more integrations between, and would it make sense to see them to convert a bit between OVS? So, the question is what about the OVS project that is on the way to actually leverage PPF? As far as I know, what they're looking at right now is actually building a fully compatible replacement for the existing kernel data path, which means the PPF program will actually parse netlink messages and take existing data path flows. Whereas, Cilium is not adopting just the flow table model at all and doesn't even have a concept of flow. So I think there is little overlap and it doesn't make sense unless we just refocus on something else other than a flow table model. Go ahead. So some of the other user space as I've seen for EPPF has been let's say having a P4 front end and then a BTF back end. But that kind of very closely follows the kind of paradigm that you described earlier, which is that you have flows, it's the same but parallel. What you're able to do here is something very different, which is that you hook up to a system card, or you could hook anywhere in the stack. But that would mean that there's a depth set of tools that can talk to something that looks like that. How do you attempt something from, let's say, your registration down to the STN controller, down to the actual platform? Cilium looks very different. How do you attempt to address like that? How am I supposed to repeat that question? No. I think it's an excellent question. The question is there's a process out there which would allow you to write P4 code translate that into BTF, which is then being run. BTF has the flexibility to be back end for P4. And the second part of the question, well, Cilium is aiming for something that is actually not in line with P4 in terms of it's not a pipeline-based flow model. It does not have a configuration and then a runtime mode. How does Cilium cope with the lack of tooling that has been created for flow-based models? Frankly, we just have to create new tooling, but this is a good chance because a lot of the tooling has been written for archimiles that have been around for 20, 30 years. It's not necessarily tooling you require in a cloud environment or in a DevOps environment where frankly, a lot of people don't even have the capability to learn all of networking. The skill sets requires very wide. If you can abstract that and actually provide tooling that is more specific and more abstract and is still sufficient, I think that's the way to go. An application developer would like to see a packet drop notification from this container to this container or from this service to this service. An IP address doesn't tell the application developer anything because it doesn't even know what an IP address relates to in terms of which service that might be. The question is is Cilium capable of adjusting monominipally NSH headers, right? Yep. BPF can manual any packet data. It can also extend modified packet size. It is capable. As I mentioned in the talk, I have actually implemented a full NAT46 implementation of BPF which also requires obviously to re-write the entire header and change the packet size. So absolutely, yes it is capable of modeling NSH headers. I want to point out the limitation though. One limitation is that you cannot have a plain loop in BPF. You need to unroll the loop because a loop would potentially be a problem if it never breaks. Out of the loop you can crash your Chrome, right? On this side, you need to be more creative. You can just code it single. Yeah, absolutely. BPF is not specific to any protocols at all. It's completely generic. If two minutes, I think it's one more question if there is one. Otherwise. Thanks a lot guys. Be sure to get a BPF a Cilium sticker. We have one of the first guys to get one, so I think you should publish it.