 So shall we start or we wait for one or two minutes? Okay. Hello everyone. Welcome to this talk. This talk name is pretty long, so let me read it here. Kubernetes network policy enforcement in XDP without IP translation. Let's say the keywords. Kubernetes is an environment our system is running on. Network policy that's a project target we are designing under implementation. XDP is a major technology we are using. IP translation is a optimization we have done. So my purpose in this talk is to let you know what we have done, what we have achieved through these keywords. The core part is optimization. We improve from big ON to O1 solution. We further improve the O1 solution with less calculation needed. I hope after this talk, you will have a clear understanding to the optimizations we have done. So this is me, Hongqiang. So I'm a software engineer in future with CloudNap. There are many interesting works in CloudNap. We work on Kubernetes. We have project named Actos, which is to improve Kubernetes by performance and also adding tenant concept. We also have projects named MISA, which is a networking provider to provide a high-scale and high-performance network for Kubernetes. These works are research works with challenges. We would like more people to join, either join our open source projects or join our company. This is another speaker, Xiaolin Ding. He is not here today, but he will try to answer questions in question and answer session. He was a manager for the project, and the ideals of this talk are majorly from him. He is a nice and kind of manager. Okay. In today's talk, first, we will have an overview of Kubernetes networking and the network policy. We shall have a brief understanding on them. Then we will introduce policy enforcement approaches. Traditionally, the network policy is enforced by IP tables. Our project didn't implement that. Our first iteration is to enforce network policy by XDP. Comparing to IP tables, it's an improvement from ON to O1. Need to mention, XDP technology itself doesn't bring efficiency to O1 naturally. So we made a special design and achieved that. Later, I will introduce that. We did further improvement which I called without IP translation here. There will be much less calculation happened in production environment. So in this talk, I plan to introduce two improvements from improvement from ON to O1 and the improvement with less calculation. After that, if there is time, I will go deep to some details. Finally, we will have a question and answer session. Here are some key points about Kubernetes networking. In the cluster, each port will have and only have one IP. A port can reach other ports through IP. By default, all the ports in the cluster are connected to each other, which is to say a port can reach all the other port and can be reached by all the other ports. Kubernetes defines this networking architecture, but who is the real one to connect those ports? Say nI, which stands for container network interface. I personally more like to call it as network provider, which is to provide network functions to Kubernetes. So Kubernetes just define network behaviors, and such behaviors shall be implemented by network providers. There are many network providers, Selen, Cardico, Flano, and Mesa. Mesa is a project I'm working on. We just said by default, all the ports are connected each other. In many scenarios, it's not desired. In a production environment, a database shall be only connected by an API server and the API server shall be only connected by phone and the server. All the ports are connected without any restriction. This is unsecure. So Kubernetes define and appoints network policy to add restriction between ports. You can consider network policy is an object. It contains rules whether some ports can or cannot connect to other ports. This is a link to network policy document. In the document, there is an example of the network policy definition I copied here in left side. Please note, Kubernetes just define the network policy format and the definition. The real work to connect or disconnect ports are done by network providers. Let's take a look at this network policy example. This is a policy name. This is a port selector part which define what the ports affected by the policy. In your cluster, there may be tens or thousands of ports. We don't want all of them subjected to one policy. So this port selector defines what are the ports affected by the policy. It will use label selector. As we know, port can be marked by labels. In this example, the label is row equals DB. All the ports with label row equals DB will be affected by this policy example. So how many ports will be affected? We don't know. It depends on how many ports have this label. It could be several ports or many ports or no ports or all ports. Whether a port is affected or not, it depends on its label. Network policy has two ends. Where it's from and where it goes, that traffic has these two ends. So one end is a port affected by the policy we just introduced. The other end could be identified by IP blocks or port selector or length space selector, et cetera. So I wouldn't spend much time on explain the policy definition. There are many details on it. So let me just use this example. So let's say originally, there are no policy in cluster. So all the ports could be connected each other. Then we add this policy, and there are several ports with label row equals DB. Then suddenly, these several ports will be isolated. They are affected by the policy, but all the other ports are not affected. They can still connect each other. Let's see this Ingress rule. It says, when all traffic reaches a port with label row equals DB, if the traffic is from a port with IP, in this data range, the traffic shall be allowed. If it's not from this data range, the traffic shall be denied. So here has a port selector. So if the traffic is from a port with label row equals front-end, the traffic shall be allowed, otherwise shall be denied. So it's similar to length space selector. So please note IP block port selector and the length space selector is in all relationship. So either of the matches traffic will be allowed. None of the matches traffic will be denied. So I call this allow or deny traffic as a traffic judgment. So it also defines ports here. So only when the traffic, in this example, only when the traffic is TCP and the source port is 6379, the traffic will be allowed, otherwise shall be denied. So please also note, this identifier and the ports are in end relationship. The traffic can be allowed when both identifier and the ports are matched. This is one example of the policy. You can imagine there will be many policies in the cluster. Each policy will control certain ports, but the policy effects could be crossed over. A traffic may be denied by a policy, but allowed by another policy. In such case, the traffic shall be allowed. So let's see how to enforce network policy rule by IP tables. So IP tables is a chain of rules. We can input into the system to define whether to accept or drop a traffic when it matches certain conditions. So it seems naturally fit for the network policy. One policy rule can be expressed as one IP table entry. So this picture shows roughly how these rules looks like in IP tables. Obviously, this is an ON solution. If in a cluster, there are 100 policies, then we need to define 100 entries in IP tables. When judging a traffic, we have to go through all the 100 entries and finally reach a conclusion whether to allow or deny the traffic. So there are so many traffic happening all the time, and they need to be judged. So it's ON solution. Do we have a better solution than ON? Yes. So in our XDP solution, we improved from ON to ON, which is a huge improvement. For multiple policies, we only need one judge. Let me introduce how it happens. We are using technology in EBPF and XDP. There are many researchers and discussion on this. For EBPF, actually, it's to compile some C code and put it into the kernel. Then the code will be invoked when certain condition triggers. So it's revolutionary technology. Why? Let's consider many years ago, we were in HTML world. The web page is static, and it can only show some information. Later, we have JavaScript, then we can program the web page, let it interact with the users. So EBPF to Linux kernel is like JavaScript to HTML. That's my understanding. When we do programming, we also need data to support business logic. So we have database. It's the same for programming in kernel. We need algorithm and it's running on data. XDP table is a data. We put the compiled code into kernel. We also input the data into XDP table. In the runtime, the compiled code will look up the XDP table, and the code logic will then go desired route. So algorithm and the data are the two key factors for kernel programming. So in our first iteration of network policy work, we call it as IP-based. Why is IP-based? Let's see here. The rule could be IP block, namespace selector, and port selector. We combine all of them to IP blocks. In the cluster, we look up namespace selector for project equals to my project. Let's say by this lookup, we get port IP 172.1.0.5. Then we look up port selector for the rule equals to front-end. We get port 172.2.0.8. So this is an example. Now we translate the policy rule into this language. By default, all the traffic shall be blocked. Then we allow traffic from three seeders. First is 172.17.00-16. So this is from the IP block definition. Second, we allow traffic from 172.1.0.5-32. It's from namespace selector. Third, we allow traffic from 172.2.0.8-32. It's from port selector. According to the policy, we need to block traffic from 172.17.1.0-24. It's from the accept part. So for ports, there is no change. So by default, we block all traffic. But allow traffic, which is TCP, and the port is 6379. In the previous page, we said the logic is on the data. So we organize the data into three tables. Table A, we called allowed theta. These are entries in table A. Table B, we called blocked theta. This is an entry in table B. Table C is, we called allowed ports. This is entry in table C. So let me emphasize here why we say it's IP-based, because we look up namespace selector and port selector, and then combine all these results to IPC does. Now we have data. And the algorithm is A and not B and C. When this expression result is non-zero, it means the traffic shall be allowed. If the result is zero, it means the traffic shall be denied. Let's see some examples. For the first traffic, it matches the second entry in table A. And the port matching table C will allow the traffic. For the second traffic, although it matches third entry in table A, but it cannot match table C, so we deny the traffic. For third traffic, it matches first entry in table A. And also match table C will allow the traffic. For the last traffic, although it matches the first entry in table A, but it also matches table B, please note here has a not mark before B. Since it's much, so we need to deny the traffic. So this is the logic. We just talk about the basic logic to handle the traffic for one policy. We know there is not only one policy in the cluster. We expect there will be multiple policies. We will handle them at the same time. So first we number the policies. The first policy is one. The second policy is two. The third policy is four. What does this really mean? For the first policy, its binary format is 0, 0, 0, 1. For the second policy, its binary format is 0, 0, 1, 0. For the third policy, its binary format is 0, 1, 0, 0. So what would be the number six mean? Its binary format is 0, 1, 1, 0. In this example for number six, actually it matches the second and the third policies, but it doesn't match the first policy. We can see very clearly from the binary format. This is the first policy speed. This is the second policy speed, this third policy speed. So this is the core logic for our policy judgment. Let's take a look to XDB tables data. For the value part, so this is the k part, this is the value part, now it's policy bitmap. It's a number, but its binary format indicates the policy matching state. For example, the third entry, the k is 172.2.0.8. The value is 9. This value means the first policy and the fourth policy matches while other policy not. We still use expression A and not B and C to evaluate the traffic, but the result can indicate the traffic shall be allowed by which policy. So for this traffic, 172.7.2.10, TCP 6379, if we use the expression to calculate, we look up the table A and the result is 1. We look up table B, there is no result. The result is 0 for table B. Since table B, there is a not back for B. So actually, it means the matching state. We look up from table C and the result is 1. So 1 and not 0 and 1 equals 1. The result is 1. For the second traffic, the same approach, the result is 9. So it shall be matched by policy 1 and policy 4. For the third traffic, it's TCP with port 8,000. By looking up table C, we get 0. So after this expression, this traffic shall be denied. So that's the approach for IP-based policy. Let's discuss pros and cons for the IP-based policy. Comparing to IP tables, it improve efficiency from O1 to O1. Because only one match, we will get which policy is allowed, or we just deny the traffic. It's for huge amount of the traffic judgment in the cluster, that's a huge win. The cons is that the XDP table is not stable. We consider networks of policy are stable. Once configured the policy, it seldom changes. But in our IP-based policy, it translate port label selector to IPs. In this example, port selector cares about port's labeled rows equals front end. When the policy is being created, we look up a cluster, get a list of ports with label row equals front end. When the policy, and we put these IPs into the XDP table, but that's not enough. Because the port can die and restart. And there are IP changes. So we have to put a watcher to monitor port change. Every time when a port create ordinate, we check its label. And if the label is related with the policy, we update the XDP table with the port IP. For big Kubernetes cluster, port change is so frequently, even there is no change on policy, we have to continuously update the XDP table. So IP-based policy is unstable. There could be many reasons for port change. When request increase, port under the same label may be automatically scaled out. Or they may be scaled down when request decrease. In this scenario, XDP table update will be triggered. Another change source is that port may crash from 10 to 10. When port crash, Kubernetes will create another port for substitution. In this process, port IP will change. Hence, XDP table update will be triggered. So for the no-night of IP-based policy, we developed new iteration, which is label-based policy. It can well address the issue. Let me introduce it. This is a brief on the approach. There are two parts. Generate label-related data and judge traffic at the wrong time. The generating data part happens when policy change, such as create, update, or delete policies. The generating data part also happens when port change, such as create, update, and delete port. The generated data will be input into XDP table for traffic judgment. The second part happens under traffic wrong time. When a port is sending out traffic, we put the port's label info into the traffic, into the packet. So the label info will travel together with the packet. When it arrives destination, the label info will be retrieved from the packet. At this point, EPPF program will judge the traffic by comparing label info over XDP table. Let's take a look to the label data. Let's say port 1 has label row equals front end, and the port 2 has label environment equals port, and app equals engines. The generated port label map will be in this way. Each label will get an ID. So row equals front end, get ID 1. App equals engines, get ID 2. Environment equals port, get ID 3. Port 2 has two labels. The combination of the label also get an ID. In this example, it'll get ID 4. Let's say we have two policies. Policy 1 applies to label app equals engines. And policy 2 applies to label environment equals port. This is a label policy map we generated. Left side is a key, which is a label ID. The right side is a value, which is a policy bitmap. For the first label, row equals front end. It's not in either of the policies. So the value is 0. For the second label, app equals engines. It only shows in policy 1. So its value is 0, 0, 0, 1. For the third label, environment equals to port. It only shows in policy 2. So the value is 0, 0, 1, 0. The fourth one is easy to confuse. It contains two labels. Apps equals engines. And environment equals port. The first label app equals engines is contained by policy 1. The second label, environment equals port, it can match policy 2. So in another words, it can match both policies. So the value should be 0, 0, 1, 1. So that's how the data looks like. We generated the data for port label map and the label policy map. Port label map is a mapping between label and its ID. Label policy map is a mapping between label and ID and matching policies. So what's the logic can be extracted from the two tables? It can indicate the relationship between port and the matching policies. Let's say traffic is from port 1. Port 1's label is row equals front end. Its label ID is 1. By looking at the policy map, the matching policy is 0. This means the traffic from port 1 shall be denied, because none of the policies are now the traffic. For the traffic come from port 2. Port 2's label, its ID is 4. And by looking up the label policy map, its value is 0, 0, 1, 1, which means two policies can match, so the traffic can be allowed. So that explains why we get ID for label combination, because when port sending out the traffic, we will put this ID into the packet, so the label ID will be traveled along and reach the destination. That's the reason for port 2. It contains multiple labels, so we will give the label combination an ID. And this ID as a single ID will represent the label combinations of port 2. As mentioned earlier, when sending out the traffic, we will put label ID in the packet. We call this as traffic instrument. We put the info into GNL headers. This picture shows how we add the label info into the packet. There are two data, int value of port label and the int value of length space label. When we write label values in GNL headers, its format is fixed. So later, when the packet arrives destination, the label value will be retrieved. So we add another XTP table D, which is allowed labels. And now the expression changes to in this way, D or A and not B and C. When the packet arrives destination, we retrieve label ID from the packet. Then we look up the table D to see whether policy allows such label. The calculation between table D and table A is in all relationship. It's matching the policy definition. If the policy define IP block, we use A and not B to evaluate. If the policy define port selector and the length space selector, we use table D to evaluate. The two evaluations are in all relationship, which means if either of them matches, we allow the traffic. Finally, the traffic still needs pass the evaluation of table C, which is allowed ports. So from the label-based policy, we still get O1 efficiency. Now there is no IP translation for port selector and the length space selector. When the policy is expressed as label, we don't translate the label into IP. We put label in the XDP table. When traffic counts, we check traffic's label info and compare with XDP table. Then make decision whether to allow or deny the traffic. There are some limitations. The obvious limitation is that it must happen in cluster internal traffic. When a port is sending out, we need to add the label IDs into the packet. If the traffic is from outside of the cluster, since there is no such traffic instrument, we couldn't judge the traffic by label. Let's take a look to this policy example one more time. For this IP block definition, it's nothing related with the label. So we still use IP-based implementation to judge traffic. In production environment, we recommend the cluster admin to define IP block to control cluster's external traffic and define port selector to control cluster's internal traffic. So I just introduced our approach in a high abstraction level. There are some details I think there will be interesting. The first one is how we organize CEDA's. In policy definition, CEDA may be overlapping. And when we translate a port label into IP address, we actually translate them into CEDA's and then they may be overlapping. In XDP table CEDA, CEDA is a key of the mapping table. So we need the CEDA key to be unique, otherwise we couldn't achieve all one solution. So we create a try to hold the CEDA's. Each CEDA is 32, so the try height is 32. We input all the CEDA's into the try one by one. They will be automatically organized inside the try then we extract unique CEDA's from the try. The second small topic is about the network policy work, how it's organized in our code. Our product is MISA. It has control plan and data plan. Its control plan is to watch the Kubernetes API server and generate data. And its data plan is to update kernel XDP table and use the table for network purpose. For network policy, control plan will explain the policy. It reads the definition of the policy and translate them into the data, which are those tables A, B, C, and D, I showed. I have to warn you that the code is quite complicated there because network policy definition has several layers and cover quite a lot scenarios. So although the code is complicated, its logic is clear, input is policy definition, then combined with cluster running state, output the data for the XDP table. The third small topic is about policy bitmap. We know each bit can only indicate one policy. So since the bitmap value is type of non-int, it's 64 bit, doesn't mean we can only handle 64 bit, 64 policy in the cluster, that will be definitely not enough. The good thing is that we don't have this limitation on cluster. We have this limitation only on ports. Do you remember in the policy definition, there is a port selector, which means what are the ports the policy has effect on? So actually, we generated a policy bitmap on port level. So the limitation is on port. We think the 64 policies limitation on port should be enough. So if it's not enough, we can change the data type. To hold like 128 policies or even more. So that's the three small topics I bring out. OK, so now, any questions? So what you describe throughout the whole session, is that an open source project, the XDP table? It's quite interesting. Is this something available for other people to try? Yeah, yeah. So our open source project, so it's St. Thomas here. Yeah, this, yeah. And our project name is there, St. Thomas. Our project names MISA. So all the code is there, open source. So you can take a look there. Do you have any performance data of the traditional way? Was the XDP way to handle network policy? So for the performance data, I haven't generated that. So it's insane. So we call it from 01 to 01. But I haven't generated that performance data. OK, so that's my talk. Thank you.