 Hello, everyone. Welcome to KubeCon 2020, and welcome to the session of multi-attended networking for Kubernetes. My name is Yin Xiong, I'm with Futureware Technologies. In this talk today, Xie Raph, who is now with Microsoft, he and I will present a multi-attended scalable network solution for Kubernetes. We hope you will enjoy the talk, and at the end of the session, hopefully you'll learn something and interesting in contributing to this project or building your own network solutions. With that, let's get started. This is the agenda for today's talk. I will give an introduction and some background on the project, why we are doing this, and what's the context. I will also talk about the current network model in Kubernetes and introduce you to the high-level design of our new multi-attended model for Kubernetes. We have a forked version of Kubernetes called Arctos. You can consider Arctos as a multi-attended Kubernetes. I will introduce Arctos more in the next slides. After that, Xie Raph will give you how we implement the multi-attended network model in Kubernetes and introduce you to MISA project. MISA is a virtual network solution that has both control plan and a data plan based on the XDP technology. MISA project is part of an umbrella project called Centera. Centera's project includes two independent projects. One is called Arctos, one is called MISA. MISA, again, is what we discussed today in this talk. However, I'd like to briefly introduce Arctos project first so that you understand some of the context. As I mentioned earlier, Arctos project is a forked version of Kubernetes with a major change in design. There are three goals for Arctos project. First, we want to unify VM or container orchestration in the runtime. For that, we explain part definition and make the runtime agent, which is Kubernetes, to be a unified agent that support both VM and containers. Second, we want Kubernetes to support from 50K to 100K nodes in a cluster. For that, we have to partition Kubernetes components such as API server, schedulers, controllers, and ETCD. Third, we want to build a true multi-attended platform. So we designed a multi-attended solution for Kubernetes, including a new tendon object and a new network model. So we believe with those three goals, Arctos takes Kubernetes to the next level, making it a true cloud infrastructure platform. So back to the main topic we have today, which is MISA. MISA network solution is trying to address the following problems. And first, we want to provide a virtual network solution for Kubernetes, so that Kubernetes is a part from different tendons that were resided in a different virtual network. Second, we want to address the problem of fast provisioning of network resources for parts, basically how to get a part ready as quick as possible from network perspective. Third, we want to have a scalable virtual network solution to support networking within a cluster of more than 100K host, and that's the problems we try to solve from MISA perspective. As most of you already know that common network model for Kubernetes is a flat model, is a single address space, and a shared single DNAs. And by default, every part or containers can communicate with every other part or containers in the clusters. So by default, there is no multi-attendant from network perspective. Kubernetes introduced network policy to isolate container or part each other. However, network policy is not secure, or as not strong isolated as a virtual network can do. For example, network policy does not prevent packet sleeving, putting somewhere where the traffic is passing by, and extract information, or data out of the packet. Additionally, some of network policies are implemented based on a UNIX feature called NetFilter, which uses IP table rules. In reality, IP table rules could get huge and then cause overhead and increase network latency. Not a security issue, but it's a not desirable solutions we want. So in Actos, we introduce network object, a new CRD object that represents a VPC or subnet. Each part to be created has to be associated with a network object, and each network object has its own IP address space. So parts or containers created in the different network are naturally isolated with the network boundary. As you can see from this diagram under slides. Now within each network, you can still use network policies to manage network security within a single tenant. The new network object we'll introduce is an abstraction of network resources, but not actual implementation. Someone still needs to actually create VPCs, create a subnet, and manage the IP address spaces, and run network traffic. This is where MISA comes into play. MISA is one of the network implementation for multi-tendent network models in Kubernetes. So now I will pass on to Xia Rev. He will present the detail of MISA design and implementation. Thank you. Hi everyone. This is Sharif. I led the development of the MISA project, and I'm now a software engineer with Microsoft. We built MISA from the ground up to accelerate pulse network provisioning at scale. Actually, to rethink cloud networking in MISA altogether. We built it in the exact same way as we build distributed systems in the cloud to make cloud network simple to understand and simpler to operate. In the rest of this talk, I will walk you through our thought process and how MISA works. From a high level, MISA consists of a CRD operators, a daemon and a CNI. The operators, the daemon and the CNI are MISA's management plan components. MISA poses a GRPC interface for the operator and the CNI. We eliminate any API from the worker nodes to the API server. This prevents operators failures to amplify as we add more workers to the cluster. On the data plan side, MISA consists of a set of multiple XDP programs, the process nodes packets. I will detail exactly how the XDP program processes the packets later. In this architecture, we rethink the data plan programming model to scale the management plan, accelerate pulse provisioning and develop customized logic for network services. As a result, MISA enables scalable and multi-tenant Kubernetes networking. Before I detail how MISA works, I would like to discuss the limitations of cloud-based network programming. Cloud-based network programming is the de facto programming model in virtual switches, including OpenV switch. I will take OVN and OVS as an example. OVN uses the concept of logical ports to create a large logical switch that spans multiple hosts. With this model, creating 10,000 logical ports generates more than 40,000 port findings. The logical switch approach does not scale as we increase the number of worker nodes of a cluster. Moreover, during flow programming, it's not uncommon to observe an increase in the CPU utilization during flow parsing. With the logical switch architecture, the time to provision network resources for each new container depends on the number of containers that already exist in the system and the number of worker nodes of the cluster. So clearly, the logical switch approach restricts scale and is not suitable for dynamic cloud applications that have a short lifetime span, like serverless. With the limitations of flow programming model in mind, we redesigned the host networking in MISA to interconnect containers only with XDP programs. And to do this, we attach an XDP program on each physical interface of a worker node. We named this program transit XDP. This program processes all ingress packets to the worker nodes. We also attach another XDP program on the VIF peer connecting a container in the root namespace. We call this program the transit agent. The transit agent attaches to the VIF peer to process all the egress traffic from each container. From a management plan perspective, all what we needed to do next is to expose the EPPF user space API as GRPC interfaces. The operator programs logical functions of the XDP program through these RPC interfaces. To understand the rule of virtual function, we need to look into the new network organization of Kubernetes that MISA enables. We extended Kubernetes with two resources that we typically find in any multi-canon cloud system, virtual private clouds, VPCs and subnets within the VPCs. Creating VPCs and subnets is straightforward in Kubernetes with CRDs and operators. On the data plan, we introduced new logical functions within the XDP programs. The first logical function is bouncers within the network scope. And the second are dividers within the VPC scope. Unlike logical routers or switches, the bouncers and dividers are in network distributed hash tables. And I will detail exactly how they work in the next few slides. Bouncer's and dividers are the logical functions that make up the VPCs to isolate pulse traffic for multi-tenant and allow tenants to reuse the same network address space. The user creates a VPC like any object in Kubernetes as a YAML file. The user specifies the side range of the VPC and the number of VPC dividers. Because the divider is a distributed hash table, we can provide any number of dividers in the object definition with one being the default. When the VPC operator receives the VPC object, it schedules the VPC divider on one of the worker nodes of the cluster. Scheduling the divider does not mean that the operator runs any new code on the node. All what the operator does is labeling the selected host as a divider for the VPC. The Kubernetes operator also assigns a unique identifier for the VPC that the data plan uses to separate traffic, which is known as the virtual network identifier. After creating the VPC, the user now creates a subnet within the VPC. We also provide any number of bouncers to create for each network with one being the default. When the network operator receives the subnet object, it schedules the bouncers on some of the worker nodes of the cluster. Scheduling the Bouncer involves two actions. First, the operator labels the host as bouncers on the management plan. Second, it programs the divider's worker node through RPC calls. The RPC call simply populates an EPPF map in the transit XDB program within the network, within the VPC, and IP addresses of the hosts that are bouncers to these networks. In this example, Net1 has a Bouncer1 and Net2 has a Bouncer2. Both of them are populated in the EPPF map of the divider host. Now comes the interesting part where the user creates a pod within a multi network, a multi-tenant network. Natively in Kubernetes and similar to what you typically find in any cloud system. To do that, we use annotations of the pod object to specify the VPC and the subnet of the pod. The project host controller adds the network and nick annotations that I'm showing in these slides. The Mizar operator uses these annotations to provision the pod within the requested VPC and network boundary. The Mizar operator provisioned the network resources with a constant number of RPC calls, typically two. The number of RPC calls does not depend on the number of worker nodes in the cluster or the pods already provisioned. This is what allows the network provisioning to scale. The operator makes one call to the Bouncer host of the subnet. And this call effectively adds an entry in one EPPF map with the IP address of the node hosting the pod. This call provisions the VEath peer interface for the pod and make it ready for the CNI to consume. Internally, when the CNI adds the network interface, it makes a local call to the Mizar daemon to consume that interface. This design significantly simplifies the CNI. When the CNI adds an interface, it only consumes the interface that is already created by the Mizar daemon. The effect of this provisioning workflow is significantly better scale and significantly better time to provision networking for the pod. The time to provision network resources for the pod is now constant and independent on the number of worker nodes of the cluster or even the number of pods already provisioned in the cluster. Compare this to OVN, which does not scale well as we add more nodes in the cluster or if the number of pods existing already increases. Up until this point, I described the management plan operations and in the next few slides, I will describe in detail how the XDB program on host process packets. Consider the case in which a pod with IP address 10-001 on host A sends a packet to a pod with an IP address 10-002 on host B. An XDB program intercepts the outgoing packet from the pod when the VEath peer receives it in the root name space. The XDB program simply looks up a static configuration in an EPPF map and encapsulates the packet into a Geneva packet. It also assigns the virtual network identifier of the VPC in the Geneva header. Several tenants still use the same address space of the VPC where the network distinguishes traffic within a VPC by the VNI field. The only information available to the transit agent at this stage is the IP address of the bouncer. So it sends the packets to the bouncer by redirecting it for transmission on the physical interface. When the bouncer receives the packet that an XDB program is first to process it on the bouncer host. The XDB program looks up the inner destination address of an EPPF map. Then it re-rides the outer destination IP address to host C, which is the worker node running the destination pod 10-002. When the packet arrives at host C, the XDB program decapsulates the packet and redirect it to the VEath peer of the pod to receive it. This approach greatly simplifies the pop provisioning, but it has a serious drawback. All the packets now traverse one extra hop to reach their destination. I will now describe how we solve this entirely in XDB. Overcome the extra hop problem, we modify the XDB program running on the bouncer host to respond to our queries. Since we already have the pod's IP and MAC address configured by the Miser operator when it provisioned the pod's network, so it makes sense to respond to our queries at this stage. When the pod at host A sends an ARP query, the bouncer responds with the MAC address of 10-002, but it does not only respond to ARP queries. The bouncer also adds a Geneve option in the outer packet to tell the transit agent that 10-002 is hosted at host C. When host A receives the ARP reply, it extracts the Geneve option and adds the host mapping information in its EBPF map. Now the transit agent of 10-001 sends packets directly to the destination pod. And this direct communication happens from the very first packet of the flow, and it remains throughout the pod's lifetime. There is one more detail here. When the transit agent sends a packet directly to the destination pod, it toggles one pit in a Geneve option to tell the destination host that the packet is sent directly from the source pod worker nodes and not from the bouncer. This allows the transit XDP at host C to also return packet directly to the source pod. And this simple mechanism allows all flows in the cluster to be direct without traversing the bouncer. And at the same time allows the management plan to provision the network by only making few RPC calls to a couple of hosts in the cluster, not all the hosts in the cluster. And if you think about the role of the bouncer now, comparing it to a logical switch or a logical router in OVN, it is an in-network as the in-controller rather than a virtual switch. It's like a microservice in the network that provides distributed functions to the endpoints. We take this observation to extend Mizar functionality beyond providing simple connectivity between the pods. Essentially, we extended the bouncer functionality to implement Kubernetes services as well. This is best to be explained by an example. Consider the 1001 pod with sending packets to the 192.168.0 service. The transit agent XDP program first processed the packet, which knows nothing about the network except sending the packet to the bouncer at host B. The bouncer receives the packet, it look up the destination IP address of the inner packet, and it reminds it is for a service IP. There are several decisions that the bouncer can make right now, including rewriting the inner destination IP address and sending the packet to a back end pod, like any conventional net or low balancer device. But I will describe a different approach. The bouncer instead adds a Geneva option of its decision. It remarks the pod's transit agent how the service should modify the inner packet. In the shown example, the modification option says to rewrite the service IP address to 10.004 and port 80. Then the bouncer returns the packet to the sender host, which is host A. The transit XDP program on the client's host add this information option in an EPPF map entry. Then it rewrites the packet according to the accordingly and rescinds the packet again. But this time, not to the service IP, but to the back end pod 10.004. From this point forward, the transit agent sends all the packets for this flow to the back end pod and again without going through the bouncers, IP tables, or any other intermediate step. This powerful concept enabled by XDP and Kubernetes operators allow us to scale the services and replace Q proxy without compromising the advantage of direct communication between pods over service IPs. Miser scales out the number of bouncers and dividers in the network to become a distributed in network controller that serves any traffic. Miser also implements a load balancing function on the outer IP header to rebalance the traffic to the bouncers. But we typically find that a single bouncer is enough in most cases, as it only processes our queries and the first packets of flows. I have provided an overview of Miser and there's a lot of a lot more to it in the problem. And I conclude this talk and we will now play a recorded demo before moving to the Q&A. Thank you very much. Hi for this demo today we have a three node Kubernetes cluster using kind with Miser installed. Miser is installed on this cluster via a demon set and an operator appointment. We bootstrap the cluster with a default VPC and network, each with their own divider and bouncer. Here we use Miser simple endpoints to deploy pods. Now on each node we see that an XP program is loaded on the main interface and on the V device in the root namespace we load the agent XP program. Next we demonstrate ping on the two recently created pods. Next we will create another VPC with two subnets. The VPC has two dividers and here each of its subnets will have a single bouncer. Now for each of these subnets we will create an endpoint or a pod. Now with these two recently created pods we will demonstrate cross net ping. And here to demonstrate our exhalation we will try to ping across VPCs. Now we will demonstrate our operator provisioning 40 endpoints. Regardless of the existing number of endpoints or pods on the system, all subsequent endpoints are provisioned at a constant time of about 0.35 seconds. Now in the next section we demonstrate intra and inter-network direct path. For inter-network direct path only the first ARP packet goes through the bouncer. Once both sides have cached endpoint host information, any traffic thereafter will only flow between the two endpoint hosts. For inter-network direct path the first ARP packet goes through the divider and both bouncers. Here the divider and the two endpoint hosts must cache the host information. Once all three have cached the endpoint host information, any traffic thereafter will flow between the two endpoint hosts and the divider as an intermediary. Finally in this part of the demo we demonstrate using Mizar scaled endpoint as a replacement for the Kubernetes cluster IP service. When a service is created Mizar creates a corresponding scaled endpoint. Here we label the two recently created pods to add them as back ends for the scaled endpoint. Now for this demonstration we will curl the service from our third pod and in the reply we will see that the service replies with the pod name of one of the back ends. Here both pod 1 and pod 2 reply to curl. We can also ping the service. This is possible because of the current scaled endpoint implementation.