 Microservices has emerged as the desired architecture for modern-day application. They're flexible because you can divide your monolithic application into smaller module or components, and you can design, develop, and deploy these components independently. They're scalable because you can run multiple instances of these components, multiple instances of these components as desired in a cluster, right? But this flexibility and scalability comes at a cost of deployment complexity. Imagine that you have these components with their dependency, and if they happen to run on the same host, then these dependency might conflict with each other. Well, the answer is run everything in containers. Container is this nice construct where you pack your dependency as well as your component and deploy them together in an isolation in such a way that even if two containers land up on the same host, they will not conflict with each other. But i'm sure those who have played enough with containers would have realized that container themselves are not enough, especially when you're planning to deploy them in production. One, they are mortal, so they die. You need a system that can monitor them continuously and can schedule them if they die. Second, each container consumes a certain amount of system resources, such as cpu, memory, and disk. You need a system that can do resource management for these containers. And finally, services that are running in these containers needs to be able to talk to each other or find each other. So you need some kind of service management. And this is where DCOS comes in. DCOS is this container orchestration platform which has missiles at your score, and it, along with its framework, such as marathon, can do container scheduling and resource management. It also provides service management through service discovery and load balancing, which is also the topic for my talk today. And we'll go deep in that. But before we take a specific DCOS networking stack, we want to understand what are the challenges in container networking. Say you have container orchestration layer running with a bunch of containers, the very first challenge is to provide connectivity to these containers. And these containers are different because they are not host or they are not vms, right? Second, once you resolve that challenge of connectivity, you would want the services that are running in these containers to be able to talk to each other. So you need service discovery. And now imagine, as I said, these containers keep dying and keep rescheduling on different hosts. So a service discovery mechanism should be able to update and reflect those changes in the cluster. And finally, you would want your container to have multiple instances and sitting behind a load balancer. And load balancer would have exactly the same challenge as service discovery. It needs to reflect the dying and coming up of these containers in a cluster in a dynamic fashion. So today, this brings us to our talk today. So we will go through each of these topics in much detail. But before we go there, I want to give you a high-level view of how these different components in dcs network stack fits together to complete container networking. So imagine that you have a master and a bunch of mess host agents. You could either use docker runtime to launch docker containers, or you can use something called universal container runtime, which can run both docker containers as well as mess host container. UCR has a native support for something called cni, which is container networking interface. And we'll see what it is in coming slide. But UCR, cni gives this ability where a plugging of a third-party network becomes really easy. And docker, on the other hand, use something called cnm, container network model. So this provides the connectivity. Now, service discovery is done through networking component called Spartan and mess host DNS. Spartan is this component that runs on all the agents as well as master in a distributed fashion. And we will see the benefit that it gives us being distributed, and they gossip around to get the global state of the cluster. Similarly, load balancing is achieved through a component called management. Again, like Spartan, it also runs on all the agents as well as on the master, and they gossip around to get the complete global state of the cluster. Just keep this picture in mind when we are focusing on each individual component. This will give you a context as where this particular component fits in the entire picture. So let's start our journey. So the first one is container networking interface. It is something proposed by core OS and now adopted by cncf organization body. UCR, as i said earlier, has a native support for cni. The way it works is there is a network cni isolator in the source, which is responsible for creating network namespace, and then it hands over this network namespace to plug-in, which is a cni plug-in. And the cni plug-in is responsible for connecting the container to the host network or any network. So each cni plug-in comes with a configuration. Or you can say each virtual network that is created in container networking comes with a configuration, which is nothing but a JSON configuration with a bunch of key value pair specific to a particular plug-in. But it also has two important fields. One, the name of the network, which defines what virtual network will be called. And the second is the type of plug-in. So there are different type of plug-ins like host plug-in, bridge plug-in, ipam, port mapper plug-in. And this configuration sits on each agent in dcs cluster along with the plug-in at a predefined location. Now when the framework wants to launch a task on a particular virtual network, it has to fill in network inform, mesos, protobuf. The thing that it has to fill is the name of the virtual network. And that's why name is very important in the configuration. So let's say such a task is triggered and it is launched on a particular agent. The agent will go ahead and create the network name space along with the rest of the container isolation. And then it will take that network name space, give it to plug-in. In this particular example, it is bridge plug-in because the type in the configuration is bridge. And the bridge plug-in will then make sure that this container is connected to the host network. That's how the cni machinery is working in dcus. Now one of that implementation of that cni is ip per container. When we say ip per container, a container may have ip, but it may not be routable. For example, in a bridge setup where everything comes to the host and then port map to the container network. But when we say ip per container, these are routable ip address. You can access a container through an ip address. And let's see how it is achieved. So it is dependent on mesos module. Plus cni, as I said, it uses bridge cni plug-in. And then it uses, for encapsulation, it uses something called vxlan, which is very much in Linux kernel. The way you configure any overlay in dcus is through config.tml. I'm sure those who have launched dcus cluster would have encountered config.tml. So it is like a json, or a ml file, which is a key value pair that defines your cluster. The value that you see on the screen is the default default overlay network, which comes out of the box. You don't have to do it. But you can change this configuration to change either the subnet for the overlay or any other setting. But you can also add multiple overlays. One thing I didn't mention here, how overlay is connected to ip per network, ip per container. So the ip that you get, okay, let me pause myself here. And I'll explain how ip per network is connected to overlay. But imagine there's an overlay network. And I will tell the reasoning behind why we need an overlay network. Or it will become more clear in coming slides. So at a high level, there is an overlay resource module that is running on master as well as on all the agents. So at a boot up time, the module that is running on the agents register with the master. As part of this registration, master takes the subnet configuration and break down the subnet into equal chunks of subnet. So this is done statically at the time of registration. And it hands over these chunks to each individual agent. Now there's a helper module that is running on each agent called navstar which continuously pulls local agent state for overlay. As soon as it gets the overlay configuration, it routes the map, it configures the route in the kernel for a particular overlay. And it also informs other neighboring navstar about this configuration. So each navstar is able to do a global connectivity of this overlay. Now when marathon wants to launch a task, it picks up one of the agent and the task will have an ip address that is routable. Right? This slide is the one that gives you the connectivity between overlay and ip per container. So imagine that container has an ip address. And this subnet will definitely be different from the host one. So you need some kind of encapsulation on the host network to be able to route the traffic. So imagine that container once wants to talk to container two. It will send a packet. And the routing entries on agent one will make sure that all the packets go to vtap one. Vtap one has this vxlan encapsulation. It will encapsulate and send it to the destination vtap two. Destination vtap two will do a decapsulation. And that's how the inner packet, which is actually the packet destined to container two will reach there. So that's why we need overlay and how overlay is connected with ip per container. Now service discovery is through spartan and messes dns. Both of them are open source project. Both of them monitors the task that are getting launched in the container and create appropriate srv and a records for service discovery. At a high level, the way it works is there is both messes dns and spartan. Both of these components are running on master and they pull master state. And to see if there is any new task that has come in and create new a records on srv records or to see if there is some task that has died and you need to remove those records from dns. Spartan also communicate this information to all the other spartan neighboring spartans which are running on the agents. That's how each of the agent gets the entire dns record for the whole cluster. The benefit of that is if there is a task that is running on a particular agent and it issues a query, that query is locally intercepted by spartan and respond. So the dns as long as it is within the cluster, like the dns and addresses such that it can be resolved within the cluster, the dns query doesn't leave even an agent which gives a lot of scalability. So spartan is really this distributed dns proxy which reduces the latency. Along with being fact that it is running on each agent, it also does something clever which is called dual dispatch. And we will see how dual dispatch helps in speeding up the dns resolution. And then finally it also support upstream, configuring upstream per domain. So we have dot com tld going towards one upstream and dot org tld going to another upstream. So coming to dual dispatch. Dual dispatch is this optimization spartan. So usually if you know, if you have dealt with dns, the way the dns resolution works is the system will pick one of the upstream and send the dns query to that upstream. It will wait for the dns query to fail or pass. And when the timeout happens, it picks up another upstream and sends it. So that can kind of add latency to the dns resolution. What spartan does, whenever there's a query from a misuse agent, it picks up two upstreams and do a dual dispatch. It simultaneously send queries to both the upstreams. Then one of the upstream would respond, whichever it is, responds first. It sends that response to the task that was querying it. And the second response that comes in, it just notes down the matters for that upstream. So that in future, if it has to pick the upstream, it will kind of deny it will not pick up that particular upstream because it was slow last time. Now to give you a picture of how it all fits, let's say we have a cluster. We have a bunch of masters and agent nodes that are running. And we have an upstream. So if there is a task that queries, as I said earlier, that only queries for the local dns, then dns resolution happens locally on that particular agent. If there is something which is external to the cluster, such as .com kind of a record, then spartan will send it to the upstream. And all dot misuse goes to the misuse dns. That's how the service discovery happens in dcs. Now coming to load balancing, load balancing is done through Minuteman and marathon lb. Minuteman is this layer four load balancer. It is based on tcp. And it uses something which is there in Linux kernel, the load lbs, which is load balancer virtual server. And so the benefit of that is the entire data plane is inside the kernel. And Minuteman only do the control plane, handling of the control plane. The way you use whip in dcs is through this app definition. So those who have interacted with marathon, you need to submit this app definition in order to launch your task. So this is a particular example. And it's through labels. So you need to specify a label called whip, along with the name of the whip. So these are named whip. But you can also use something called ipwhip, where you can directly give the ip address and the port. If you give name, that name is translated to the actual whip, which is at the bottom of the screen. So if you give a web server, then it will become web server dot marathon, l4lb, this dcs directory. At a high level, let's say marathon requests a task to be launched with a label foo dot 5,000. Foo is the name whip. Master will pick up, say, one of the agent, agent one, to launch this task. Then this task launched, but the actual port the task is running on is, say, 6789. Now Minuteman that is locally running on this agent, where the task was launched, is continuously pulled for state, like every two seconds. And then it gets, as soon as the task is launched, it gets this mapping between whip, colon port to the actual port of that task. Then what it does is it gossips this information to the entire cluster through other neighboring Minutemen. All the Minutemen, along with master or agent, wherever they are running, what they will do, they will pick an IP independently. The reason why an IP address is not communicated, because if you imagine IP address is a state, so if they have to communicate IP address instead of the name, then there might be a conflict between two Minutemen picking up the same IP address. So they just communicate that, hey, there's a name foo, a task with a name whip foo was launched on a particular agent, and each Minuteman will have its own IP address. Then Minutemen creates a local a record to in Spartan. Each Minuteman will do that, so all Spartan will have this a record mapping the whip to an actual IP. It also programs the colonel with the appropriate front end and the back end. So the IP that is picked for this particular whip, in this case it is one, two, three, four, five thousand, and the back end is the actual server IP with the port. So now let's say when the task two wants to connect to task one through load balanced whip, it will query the local Spartan with the DNS entry. Spartan would give the local IP address that was picked by Minutemen, and then it will try to do connect, which will be intercepted by the colonel locally on this agent, but ultimately because the IPvS entry has the back end information, it will connect to the task one that is running on agent one. That's how task two is able to connect to task one. Now coming to marathon lb, marathon lb is something that is layer seven load balancer. So when we were talking about Minutemen, it was just layer four, but marathon is layer seven and it is a wrapper around ha proxy. The way it works is at a high level there is a concept called public agent. Public agents are those agents which say have an IP address or an interface which is exposed to the outside of the cluster. So that's why those agents are public agents. So marathon lb is running on a public agent. It continuously listens on marathon event bus for any new task, right? And as soon as a new task is launched or it updates the ha proxy configuration internally. Then let's say when an external client is trying to connect to this ha proxy, it load balances on the task. And that's how you expose your internal running services to outside the cluster. It works pretty much the same way as Minutemen web. There's a label in the app definition. There are like tons of label if you've seen that wrapper, marathon lb wrapper. But the two of like in this example are most important. One is the external which define that you are exposing this service externally and then what DNS it should be having. Okay. That brings up to the things that we are currently working on. So very first thing that you would see in future is the ipv6 support in the dcs networking stack. Then there is something called cni spec versions. So right now we support 0.2.0. We want to support 0.3. Going forward, we want to support 0.3.0. One of the main feature in 0.3 is service chaining. So those of you who have some experience with the cni would know that this many a times you need different services. And by service here, I mean like load balancing service, DNS service, or ip connectivity is also service. There's an ipm service. There's a port mapper service. So these are various services. Today in spec 2.0, you really cannot mix and match these services in a way that in a pluggable way. But with 0.3, it has a service chaining which will allow you to pick any of these different services and create a chain of these services. So it can dynamically call different virtual network can have different services running for the same functionality. So that's pretty powerful. And then today in dcus, if you want to use certain cni plugin, the steps are very manual. You have to deploy the plugin as well as the configuration manually on each agent. We want to take it out and make it more streamlined. And so we are working on something called cni configuration service, which will have an effect of like clicking few buttons on the ui and your cni plugin is ready to be used in a network. And then there is a demand for having multi-tenancy network. Like certain users or certain organizations should not be allowed to launch containers or certain networks. So some kind of multi-tenancy. So we are working on something called authorization and authentication with network and also security policy. Now, security policy, when we talk about security policy, it's at two layers. One is per virtual network, like whether a virtual network should be allowed to communicate with another virtual network, what kind of isolation we need among virtual networks. And then within a virtual network, there are ACS, like what port and ip should be allowed to connect to watch port and server. This will give us the flexibility of, say, which services are allowed to connect to which all services. So we are working on that. And finally, the big picture that we started our journey from. That also concludes my talk for today. Any questions? Mostly, I have seen people using calico. Pardon me? I didn't get it. What, not? Weave network? No, no, no. Mostly, I've heard about calico. But maybe... Yeah. Actually, when you say solution, you also need to think about what you need from solution, like what kind of functionality you are expecting from a solution. Yeah. Then it's fine. Then it's fine. But there are virtual networks that provide more functionality. Yeah. Sorry. Could you repeat that question? Yeah. Right, right. So it doesn't run on every slave. It just runs on the public slaves that you have, right? And you... Depending on requirement, you... So each marathon LB instance is an HAProxy instance, right? So in certain cases, just to give you an example, let's say you want to have an HAProxy for your STTP traffic and you want to have a HAProxy for STTPS traffic, right? In that case, you may want to run marathon... Two instances of marathon LB on two public slaves. So it really depends how... What functionality you need from HAProxy. And that's defined how many instances of marathon LB you would like to use, right? Minutemen is heavily dependent on the meso state. So that part... So if you are saying it... Can it be run without DCUS? Yes, it can be. So it is independent of DCUS, but it is dependent on mesos for its state. Estimate... Probably by the end of this year. Any other questions? Yeah, currently no. They all have... So all the records that are created today have a default time to limit as five seconds. But... Okay, that's not TTL. So that's something... How fast the system is reacting to a change. That depends on the polling period. So the polling period for DNS is 30 seconds. Right? So if... Let's say some task came in when we last polled. Right? So you can expect that entry to be there after 30 seconds. Right? But that is not TTL. TTL is like if you have a cache sitting somewhere, how long it should keep that DNS before refreshing it again. Right? Yeah, that's a good point. Yes. So that's good. That's a good point. So I don't know your name. But he made a good point. So I was telling that Spartan is a DNS proxy. But Spartan is DNS proxy as well as a resolver. So there is a resolver sitting inside this part and that do DNS resolution. And that's how the local DNS queries are resolved by Spartan. Any other question? Anything which was not clear? Yes. Yeah. That's a good question. So in the config.tml... Yeah. So the question is, I said in an overlay that master statically divides the subnet into agents equally. Right? He's asking, then how do you support adding new agents? Right? Because in DCOS you can add new agents. Right? Yeah. That's a good question. So in the config.tml... Yeah. So the question is, I said in an overlay that master statically divides the subnet into agents, you can add new agents. Would they get... Because the subnet would have already be divided. How would you get new subnet? Right? New agent will get any subnet. The thing is... The thing that is part of config.tml is also the prefix of each subnet that the agent is supposed to get. That defines how many slices we will be cutting from the global subnet. So let me... I don't know if it is easy for me to jump to that slide. Yeah. So if you see, there is a third parameter called prefix, which is after name subnet. So subnet, when you say it's a whole subnet, the subnet for the entire DCS cluster, but then there is a prefix parameter which decides the slices out of the subnet. So master will just take this prefix 26 and will divide the entire subnet into slash 26 networks. And then the operator has to make sure that the number of agents... So this subnet should be big enough that it is able to be given to all the agents. If number of agent increases, then these entries have to be modified. Anything else? Yes? Right. Right. Right. Yeah. So good point. And a good eye as well. So I mentioned that CNI is the one that is invoking a CNI plug-in and that is only possible with UCR. Right. But Docker on the other hand can have this IP per container and we support that in DCS. So how it is working when Docker doesn't support CNI, right? The way we do it is custom for this particular case because it is supported out of Box from DCS. So we take the path of something called user networking Docker. So we create a user network using the CNI plug-in and launch the Docker Container on top of that. So Docker is able to launch containers if you can pre-create a network for Docker, right? And add into the Docker network list. So we do that in the MSOS module. When somebody configures an overlay, we go ahead and creates appropriate Docker networks and attach Docker container on that. So that's how it is working here. Anything? Okay. Well, thank you. Thank you. Thank you.