 Okay, I guess we can go ahead and start. I hope you had a very good lunch. Now that we have a, everyone is nice and full. This is our job, it's going to be to keep you entertained and awake for the next 40 minutes. It's going to be tough, but we'll try. My name is Tando. I'm with the IBM Silicon Valley Lab in California. My colleague here is about Wang Yang from the China Research Lab in Beijing, and my name is Dominique Azemi from the Watson Research Lab in New York. So today we'll talk about networking for container. There's been a lot of work in this area recently, but for our case here we're going to pick one particular use case and dive deeply into it and understand what's going on under the hood. So I'll start by giving some background in container and how the state of container networking in Opusak right now. We're going to pick Kubernetes cluster as our use case and we'll understand the inner working of Kubernetes cluster in Opusak. And about why it's going to take you then into the performance evaluation and what we observe from Kubernetes running in Opusak. And then Mohammed will talk about the opportunity that we see there in the future direction and with Q and A. All right, so for container, container is very useful abstraction for process. So when it comes to networking, then the initial approach has been to just treat it as a process running on the host so that your container will take the host IP and would get assigned a port. So that works, okay. But then as container become more popular and more prevalent and known, then we find that it's pretty cumbersome to manage port. It's a lot easier to manage IP instead. And this is something that overlay network give you. It allows you to have your own, to manage your own IP and it isolate you from the underlying network infrastructure. Now in the Opusak context, Magnum is a new project, it's a container as a service. And there's also many talk on Magnum so I won't go into detail for Magnum here. But for our case, we're going to use Magnum to build Kubernetes cluster for us to do the study. So the main thing about Magnum is that besides just deploying the cluster and managing your cluster, it really integrate the container side into the fabric of Opusak. So for instance, use Nova instance for hosting use sender for box storage. And most importantly for us here, it integrate the container into a neutron. So for the goal, the goal for our study here is going to be to pick, like I say, a particular case and just to dive deeply into it. We will use Kubernetes cluster. So we use Magnum to build the cluster for us and then we do, we will go in and understand what's happening in the Kubernetes layer and we'll do a performance evaluation on that. And hopefully from that we can figure out what is it that we should work on to make improvement. So if you take Magnum right today and deploy Kubernetes cluster, this is what you get in Opusak. So at the top you see three Nova instance. So here our cluster is three. One of them is a Kubernetes master. The other two are the Kubernetes nodes that they're going to host your container. So these three instance would have an ETH0 interface, neutral interface that connected to the private subnet in a private network. So these are neutral network. And then the private subnet is going to be connected to the public subnet through a router. And we have a couple floating IP that allow us to log into the host in the Kubernetes cluster. We also have a number of load balancer. One is called for the Kubernetes API server so that you can have multiple Kube master. Another is for HCT and then this is used by Deflano. And if you have service that you deploy in Kubernetes that require load balancer, then each of them would have load balancer created also. So this is basically what we get. Now, before we go deeper into what Kubernetes does in terms of networking, it's helpful to understand some of the basic concepts. So I'll describe them briefly here. So there are basically three abstraction in Kubernetes. First one is a part. Part is basically a group of container that run on the host, on the same host. A part would have an IP. And so that means that a group of container that's run on the same host would talk to each other using local host. If the container need to talk to a part elsewhere, a different part, then you have to use the IP address. That's the key here. The second abstraction is service. So service is basically a proxy for a part. That's all it is. And with that, a service has an IP address associated with it. And the reason for proxy, for service is because a part is not stable. It could die and be recreated elsewhere. And so with that, the IP for the part can change. So you don't want to use the IP for the part to talk to it. Instead, you use the service to talk to your part. The third abstraction is replication controller. And so what it does is that it watch a part and just maintain a exact number of replicas for the part. So for our case here, we are mainly concerned with part and service because they require a networking. So to support this kind of networking structure, Kubernetes use three components. The first one is the Kube proxy. So this is a Kubernetes component that run on every node. The second part is the Flano. Now there are several options for providing networking in a Kubernetes cluster, but Flano is a common choice. And basically what it is, is an overlay network. Then the third thing that Kubernetes manage is a set of IP table rule and this run in the kernel. So we will take a closer look at these components. All right, so now if we take all those together, the overstock structure and the Kubernetes cluster structure and then this is what you get. Here you have two nodes, right? You have the connect together on the private network with each is zero, you have a Flano overlay, right? And then you have the component there to support the Kube proxy, the IP table. So here I show two parts, one on each host and within the part are a set of container. And you see here that the part have an IP and the part on the right side has a service associated with it. So then the container from the part on the left would talk, if it need to talk to the container on the part on the right, and it would go through the service. Great. So that's how the Kubernetes cluster operate. All right, so next we'll take a deep look at what's happening under the cover when the networking happened in Kubernetes. So here I walk you through the first, the setting up of the networking in Kubernetes, right? Here I'm showing an over instance coming up for a host. Here we have an IP address of 10.30.68. So we come up with just the eth0 interface to Neutron. Outside there's an HCD server and this is running on the Kubernetes master, right? So as the host come up, first thing that there is is that it will start the Flano service and the Flano, what it would do is that it would go through a little protocol in HCD to basically allocate itself a subnet that we could use for this host here. So here we can see that it had obtained for itself this subnet 10.100.5.0.34. So then next what Flano would do is create a Flano zero tunnel interface on the host, on the node. And then it would add a rule in the IP table, a masquerade rule for the post-routing change. So basically what it does is it's going to route all the traffic to the Flano zero interface. So that takes care of the Flano overlay. Next we have the Qproxy, which is just a process called from Kubernetes that will start up. And then when Docker start up, it's going to create for itself a Docker bridge, a Docker called Docker zero. Then the bridge is going to get an IP address from the Flano overlay network. So that provides the basic structure. Now when you start to create a part in Kubernetes, then basically you will have a set of containers that are for the part. For each of those containers, you have a type interface that would connect to a VETS interface on the Docker bridge. So that takes care of your part and container. Your type interface is going to get IP address from the Flano overlay. Then finally when you create a service to be a proxy for your part, then Kubernetes does a number of things. So first it would allocate an IP address from the Flano overlay network to that service. So here we see that it's getting the address of 10.254.10.54. Next what happens is that the Q-proxy is going to allocate a port. Here we have a port of 42.140 and it's going to start listening on that port. Then finally it's going to add two rows to the IP table. One in the output chain and one in the pre-writing chain. And what this rule is going to use the IP address for the service. And basically the net is that it's going to route all the traffic targeting this IP address to the port on the Q-proxy. So everything is going to go to Q-proxy and the Q-proxy then would know that for this IP address it represents this set of part here and it knows how to forward the message. So that's the structure. So now let's take a look at what happens when the message gets sent. So here I'm showing two Nova instance on the top and one at the bottom and on the left-most side I see two set of part. The one on the top has an IP address of 10.100 but five at three and one at the bottom has IP address of 10.100.70.2 so all these are on the final network. So as you remember the two talk tools contain that we can't talk to the IP address directly so we have to have a service. So here we have a little service that purple box here that has IP address of 10.204.10.54 so that would serve as proxy for the part at the bottom. So let's suppose the partner at the top want to talk to the top part at the bottom see what happens. So the first message that come out of the container on the partner top would go to the Docker zero bridge. It's going to have, if you look at the packet it's going to have the source IP address of that part and the destination IP address of the service. So that's what we expect. So as we remember from the last slide we have two IP table tools what it was going to do is it captured that packet and rerouted to the key proxy on that port 42140 so key proxy get it and key proxy knows from the record within Kubernetes that this particular service is proxy for the part at the bottom. So it has that mapping. So then what it's going to do next is it's going to change the header and route that traffic to the correct IP address for that part. So now you see the, it changed both of them it changed the source IP and it changed the destination IP to the 10.100.70.2. And then the most great rule that we saw in the last slide is going to capture that and then route to the Flano zero tunnel network tunnel interface. So that will take you to the Flano daemon. So in Flano it's going to take a look at that IP address 10.100.70.2 and it knows that that gets mapped to the Nova instance with the IP address of 10.100.30.67. So then what Flano would do is it would encapsulate that message and would give it the IP address of the two Nova instances. So now we see that the IP address changed to 10.100.30.68 for the Nova instance on the top and the destination is going to be the Nova instance at the bottom. So after that then we end up in Neutron land and Neutron is going to do its thing we will take a closer look next. Neutron will do its job and then we deliver the message to the ECS zero interface on the host at the bottom. And that would get passed to Flano and Flano would encapsulate the message and then you get back the message that would get route to the right container. So that's the long story of how the message would traverse the whole networking chain. Here we see that the purple part, the purple message at the top is the original message and then the blue message are the proxy message that Kubernetes implement. Then the green message are the overlay message and then the red part is the Neutron. So that's just what happened on the Nova instance. So remember once you leave the Nova instance we end up in the Neutron side. So this is a picture that I copy from the Neutron networking guide. So I just put it here so that we can get an idea of what happened on the Neutron side. So here we are using the ML2 driver for OVS so once the message leaves the Nova instance it's going to end up in the Linux bridge and this is where we apply the security rule and then it would end up in the integration bridge. So from this point on it depends on your kind of network. Suppose we use a VLAN network for our Kubernetes cluster then what happens is that it would take the path at the bottom it would go to the OVS LAN bridge and then from there it would go out to the physical VLAN. Now if we had used a VXLAN or GRE network for our Kubernetes cluster then it's going to take the path on the top it's going to end up in the OVS tunnel bridge and then from there now we have another encapsulation that will happen for VXLAN or GRE. So you can see that now in this case we actually have double encapsulation what happened at the VLAN level and what happened at the GRE or VXLAN level. So that's what's happening under the hood we can see that there's a lot of complexity here it works but there's a cause for all this complexity and it gives you flexibility so what we did next is to measure different paths through this whole scenario here and to understand the cause of the performance here. So with that let me pass on to my colleague Bahua he will take you through the performance observation. So this is Bahua Yang from my bureau research channel I will introduce some of our performance and observation part and during I attended this one-stack summit I noticed there are lots of continuous sessions and mostly the people there are discussing the networking problem so it is a good timing for the networking guys to listen to the application developers. So here I will show some quantitative measurements with real data. So this is a quick look of our test environment it's simple but typical on-stack deployment we have three nodes one controller node and two computer nodes and also we have two network it's 10 gigabit and all the nodes are IBM X-series servers with two CPU 10 cores each and 256 gigabit memory and we also have a written disk well inside each computer node we have put several VMs and we have two kinds of VMs one is for the Kubernetes master and the others are Kubernetes slave nodes and inside the Kubernetes slave nodes we are running the container it's just nothing special So the scenario we consider here we have three kinds of traffic paths and the three kinds are neutral implementation the traffic paths include the server to server and the VM to VM and also pod to pod well no sorry consider to container to container well consider the container to container scenarios we have also three types containers inside the same pod and the different pod but on the same host also different pods on different hosts so here I want to ask a question how many people in this room are networking guys please raise your hand okay there are a few okay if you are not networking guys oh sorry what happened see you later yeah sorry some mistake okay so please remember these two concepts throughput and latency this is really important because if you know these two concepts they are almost the networking expert okay so the first chart I show here is the VM level performance in VM level we utilize the neutron for the overlay so there is a single layer of overlay so if you look at the chart we actually have four cases the server to server and the VM with the flat networking VM with VLAN, VM and VXLAN so compare with the server to server to VM flat the bandwidth drops to nearly 18% however the latency increase to over three times well compare with VM flat to VM VLAN the only difference here is the tag and the bandwidth doesn't change too much and the latency increase 10% and compare with VM flat to VM overlay here the difference is we introduce a single overlay layer the bandwidth drops to too much it's to nearly 26% and while the latency doesn't increase that much so I want to emphasize here actually from this chart we can see if we consider the throughput the overlay is important overlay kills throughput while the virtualization layer kills the latency it's very important so what is the performance bottleneck for the single overlay cases actually it is known as the packet processing capacity here we show the data VM to VM using VXLAN data we change the MTU the maximum transfer unit size from 1450 to 1000 it means we will generate more packets it occupies CPU processing capacity and the bandwidth will decrease sharply to half so that's why hardware offloading is widely adopted in physical host actually to optimize this case there are also other techniques for example the jumble frame but you should take care of using such kind of techniques for example we should consider the fragment problem it seems with only single overlay the performance drops a lot so is it really bad? so this is a picture from our previous results in Wacoa so it shows with single flow the bandwidth usage is bad however when we increase the flow numbers the bandwidth usage will increase to nearly a full utilization but with too many flows the number will come dropped down so I guess the answer to my previous question it can be true right so I think the answer here it should be we should consider the workload so do our application actually need a single high throughput flow for some cases like NFV this answer is true right but if we consider cloud computing we consider IoT it's a naturally multi-tennis and it naturally multi-flows so the answer cannot be true so the previous results we only focus on the VM layer here we are dropping we are go to the container level so when we consider the container level the scenario will become complicated because the existing container techniques such as Kubernetes is already implementing some early techniques so if we just directly put the Kubernetes on the side of OpenStack then what happens? there are double overlaid right so if you're a networking guy you know for similar scenarios in VLAN there are some standard Q in Q to try to solve the double tagging problem but for double overlay there's no good solution now so we also compare the throughput drop and the latency increasing the throughput drop here is over 40% and the latency increasing is over 30% it looks not that good however compare with the single overlay overhead which is over 17% and which over 200% this number is not that bad right so to optimize the performance actually there are many kinds of viewpoints what I want to show here is simple but very important is networking backend it doesn't matter so here we compare the Flanel implementation with different networking backend using UDP and VxLan encapsulation those two are very popular encapsulation techniques and VxLan encapsulation it obtains 3 to 5 times throughput wow the number looks wonderful no we just tested UDP and VxLan these two are recommended by Flanel so we also checked the additional cost introduced by the Kubernetes or Flanel components such as IPT boards and core processes and the throughput and the latency looks changed a little so what if there is no overlay and no wayline tagging and no virtualization the pure container to container performance here is the result we test the two paths on the same host these two paths connected directly by a single links bridge and the number looks not that good it's over 30% throughput drop while the latency increase over half I'm going to talk about a few things that are happening to improve networking for containers in the context of open stack and beyond and I'm going to talk about a few things that you may have already heard about from previous sessions but I'm going to briefly talk about a couple of relatively new efforts that are happening as you guys may be aware Magnum is going through a phase of defining the container networking model and trying to provide a more generic networking system where you can have different types of networking for your containers between Magnum at the same time there have been significant amount of work on the Docker side I will briefly talk about Lib Network and what that brings into the picture and I'm going to briefly talk about an open stack project that got started a few months ago called career where dockers or containers are connected to neutron networks and I finally will talk about one effort that is happening between the neutron community to help networking for containers so as you guys may know Docker ended up becoming a more modular piece of software and one of the main pieces that got its own module and got separated from the core Docker engine was the networking module Lib Network was created and introduced in Docker 1.7 it stayed as an experimental branch for a couple of cycles but by the upcoming release 1.9 it will be part of the Docker release it implements what they call the container network model not to be confused with the similar term in Magnum it has a few simple concepts not very different from what we have in neutron there is a notion of sandbox that contains the configuration for containers networking stack and networks, a collection of endpoints that can communicate with each other and endpoints are essentially what connects containers or sandboxes to these networks these are pretty similar to the concepts we have in neutron in addition to becoming a separate module in Docker the main improvement that we are witnessing is the fact that Lib Network is extendable it has a simple but powerful, pluggable architecture it comes with a few drivers namely the null, host, bridge and overlay drivers the bridge driver is essentially the main traditional Docker networking the code has been redone but it provides what you used to get by default from Docker and there is a new driver which does multi-host networking in Docker to the use of overlay network in addition to these four drivers there is a remote driver that acts as a proxy to utilize Docker network plugins so in the bigger picture of plugins in Docker you can have network plugins and they can implement networking API to the remote driver the remote driver uses a simple JSON RPC to talk to Docker network plugin and that Docker network plugin can realize the networking needs in any way possible one possible solution would be utilizing OpenStack Neutron and that's where project career come into the picture career is essentially a Docker network plugin that uses Neutron to implement networking for Docker containers it gets utilized through the remote driver of Lib Network and the plan is to use the COLA project and provide containerized images of common Neutron plugins for ease of use the project is very new it started just mid-cycle, Liberty mid-cycle and we hope to have a first release by the end of Mitaka there is a lot to be done but we've just got started it's part of the OpenStack ecosystem it uses Keystone for authentication Neutron for doing networking and to the extent possible it uses other possible OpenStack services whether it's Oslo or Neutron Client and so on and so forth it essentially is a simple plugin that maps Docker networks to what else Neutron networks Docker endpoints to Neutron ports and now with the latest release Docker provides also IP address management there is a pluggable IPAM driver there that can be equivalent of subnets as you know in Neutron and similar to how NOVA plugs and unplugs virtual interfaces into VMs and the network Docker has joined and leave and for different types of virtual interfaces that needs to be done to different pieces of code similar to what we do in NOVA and to some extent I believe in Neutron for certain services so with that I just want to mention another effort within the Neutron community Vlanware VM which was started last cycle the space got approved and the work has been carried out it turns out that this is a solution that is very useful for having nested configurations where you have containers within VMs something that we have in Magnum and there is interest in having such a nested architecture for various reasons mainly security reasons so in order to avoid having overlays on top of overlays one possible solution would be the proposal for Vlanware VMs where there are two types of ports and a new type of port called trunk port gets defined as a new resource in Neutron and other ports can carry some information about Vlan IDs that get utilized to distinguish traffic that are coming originating from different containers from within a VM there is a parent-children relationship between these ports so each sub-port or Neutron port can belong to a different network and the Vlan tags that are used here is just to distinguish the traffic within the VM the initial patches are out for review it has been renewed kind of interest in this subject the original proposal wasn't mainly targeting containers within VMs but it turns out that it's the perfect solution for supporting containers within VMs so there are multiple communities within Neutron and Magnum and OpenV switch or Oven that are interested in this effort hopefully we will see this being moved forward with wider participation from the Neutron community with that I want to conclude our talk by saying that there is a lot that is happening in the networking side for containers we are just getting started but as things are moving really fast we are trying to catch up with all the developments that are happening and the container communities in particular Docker and we are just at the beginning of this journey that is going to be a fast-paced move but we have a lot of areas to cover to support nested architectures to support different kinds of higher-level container services whether it is Docker Swarm or Kubernetes or Mesos and all that so this is just work in progress but it's an exciting time for working on the networking side of containers and please feel free to join the efforts either in Neutron, in Korea or Magnum with that I think we have a few minutes left if there are any questions yeah No, that's part of Neutron Project Korea hopes to use it and in doing so we hope to contribute to pushing it forward Yeah but the changes are required within Neutron Hopefully we get Bob's help as well Thank you very much