 Okay, let's start. I hope you had a good lunch and are ready to have a little bit into engineering I'm happy to introduce my colleague Deepak who is the maintainer of open source networking components that we built upon on top of Mezos Maybe you've heard Spartan minuteman navstar and he's going to talk about the networking stack DCS networking stack Deepak. Thank you. Thank you Welcome everybody So to begin with Alex has given some introduction to who I am. I'm Deepak who's basically a technical lead atmosphere maintaining the open source project that Alex just mentioned which includes service discovery components and load balancing components and today's talk is primarily focusing on these components and how they work together To build container networking in DCS So microservices has really emerged as a modern-day architecture for applications The reason being they are flexible because you can divide your monolithic application into simpler components and you can design develop And deploy these components independently. Also, they are scalable because you can independently decide which component you want to run multiple instances, but the scalability and flexibility comes at a cost of deployment complexity Imagine that you have these components and their dependencies and if these components happen to run on same host Then the dependency might conflict with each other Well, the answer is run everything in containers Containers is this nice concept where you pack your dependency as Well as component and run them together in an isolation such that even if two Containers happens to run on the same host the dependency of the component won't conflict with each other But I'm sure those who have played enough with containers would understand that container themselves are not enough One these containers are transient. They are mortal. So they die or they could die And then you need a system which could continuously monitor these containers and can launch them in your cluster Each container consumes certain amount of system resources like CPU memory and disk you need some sort of resource management for your containers in the cluster and Finally the services that are running within these class containers needs to be able to talk and discover each other and So you need some sort of a mechanism that can do service management For these container in your cluster and this is where DC us comes in DCS is this container orchestration platform Which has missiles at its heart and it along with its framework Supports container scheduling and resource management. It also do service discovery through through service discovery mechanism and load balancing and we will see In today's talk how this is done in DCS But before that Before we actually look into specific DCS stack. I wanted you to understand What are the complexities or challenges in providing container networking? So let's say you have this DCS container orchestration platform running bunch of containers The very first challenge is to provide IP connectivity among these containers and The reason this is a challenge is because these containers have different modes of operations They could be running on the host mode Just like a host or any application on the host or they could be sitting deep inside a VM Which again could be running on a whole so it's like multi-layer Deployment right Once you resolve the IP connectivity The next challenge is to provide service discovery mechanism like I was saying earlier the services that are running in these containers needs to be able to talk to each other and as These containers are transient. They can be continuously be dying and getting these schedule on a different host So your service discovery mechanism should be efficient enough to update the service records in a timely manner and Finally you would want multiple instances of these containers to be running behind a load balancer and load balancer also Pretty much share the same challenge as service discovery It needs to reflect the changes that are happening in the cluster in a timely manner This brings us to our today's talk. We'll be touching Or I'll be giving a Detailed overview of what is CNI container networking interface Then we'll go and see how service discovery is done in DC us networking stack and finally the load balancing But before we dive deep into each of these components Let's see how the overall picture looks like and how these components fits in together to complete the container networking in DC So let's say that you have a master with bunch of agent nodes You can either use docker runtime to launch docker containers or You can use something called universal container runtime missiles runtime to you to launch both missiles containers or docker containers now UCR has a native support for CNI Which is container networking interface and will be seeing What it is in follow-up slides But UCR CNI is the specification which makes it really easy for any third-party network provider to write a plugin against that Specification and then it it automatically works with missiles So that takes and that's why you see that there are so many networking third-party network providers That provides the IP connectivity, so they take care of connecting the containers and providing a flat network Docker on the other hand use something called container network model, which is Kind of similar to CNI, but it's not like a standard. It's their own home-grown thing But it works similarly now the service discovery part is done by a component called Spartan and missiles DNS Spartan is Is a component that runs on all the nodes including masters in a distributed fashion and we'll see what benefit it gives us being distributed and Then they gossip around to get the cluster Information similarly load balancing is provided by a component called Minutemen It is also like Spartan fully distributed and runs on each node and then they gossip around to get the cluster global view So just keep this picture in mind while we are discussing each of these Components separately to just give you a context as how these components fits in the entire picture Starting with the container networking interface. So this is something proposed by Coros and now it's Adopted by CNCF body. So as I was saying earlier UCR has a native support for this so the way it works is there is a isolator in Messos which is Which is a network CNI isolator It's responsible for creating the network namespace for a particular container and then this namespace it invokes a CNI plugin which is setting on each agent at a predefined location and it gives this Container namespace or the network namespace to that plugin the plugin does the actual work of Connecting the network namespace of the container to the host and provides the connectivity that way To give you a specific example So each CNI or each virtual network in DCS Comes with a configuration file and this configuration files pretty much define the name of the virtual network Which is an important field and the second important field is the type of plug-in So there are different plug-ins depending on the functionality they provide like you have a bridge plug-in You have an IPM plug-in which is responsible for IP addresses. You have IP V land plug-in Mac V land plug-ins and Many more so the type defines what type of plug-in you would be using for this configuration and the name defines a virtual network And we will see why this name is so important So this configuration sits on each agent at a predefined location Now let's say there is a task That is being launched on this agent The way this particular virtual network is used or the way this particular task get assigned to this network is through the name So if you see the name field Is Missos net which is which matching which is matching to the CNI configuration Now let's say this task is getting launched at this agent the agent The CNI isolator running in the agent will create the network namespace for the container and then it will hand over this namespace to To bridge plug-in the reason it is bridge plug-in is because the type in the configuration is bridge for this particular example And so bridge plug-in will take this network namespace and connect the container to to the host networking or whatever logic it has built This is how the the container networking interface in general work Now one specific Implementation of this CNI is present in DC us in the form of overlay networks Why we need overlay network is there is this requirement that sometimes you want to give IP to a container Usually the IP addresses that you give to a container are non-routable Like if you're launching a container in a bridge mode then that IP is not Routable on the host network, but you could create something called overlay on the host The reason why it is called overlays because the host network is considered as underlay and you are kind of doing an encapsulation on top of that So say you have two agents and running two containers the container has its own IP address Which is in a subnet which is different from the host IP address So then if container one wants to talk to container two the way it happens is container one will Will form a packet with the destination of container twos IP address It will send that to something called a V tap interface, which is running in the host V tap interface will create this Encapsulation and send it to a neighboring V tap and neighboring V T V tap will de-capsulate this packet and send it to the container two So that's why this IP per container is achieved through an overlay now The the reason why I said that this is a particular CNI configuration because it uses bridge CNI plug-in for connectivity if you see the previous picture There's a bridge involved so this bridge is coming from the bridge CNI plug-in and then it uses something called vxln Which is out of shelf encapsulation algorithm and Linux kernels The way you could use an overlay network is through a config.yaml I'm sure those who have played enough with DCS would have encountered this config.yaml It is basically a jyaml file with the initial configuration of your cluster. So you can define your overlay For your cluster. So this is the the the configuration that you see on the screen is the default configuration It comes out of box But you could or you could change this configuration either the subnet or you could add new overlays Each overlay is defined by a particular subnet in this case the subnet is nine slash eight subnet You could have your own overlay with a different subnet To give you a high-level view how the request or how this overlay is set up in in DCS It's let's say that there is a there is an agent module that is running on each agent as well as on the master When agent comes when this overlay module comes for the first time it registers to the master At the time of registration master will take the the whole subnet and slice it down Into equal chunks and give this one chunks to the agent. That's how agent gets the IP address So the partition of the subnet for overlay is very static Right, then there is a utility called nefsa that is running on each agent that pulls the local Module the woolly module. So when the model on local model gets the subnet from the master This utility picks up that subnet and configures routes appropriately into the in the host kernel it also gossips this information to rest of the Nefsa that are running on other agents and that's how they build a complete mesh of Network so that each agent knows what are the subnets the neighboring agent is holding and how to reach there So when a task is launched by a framework say marathon and the task now has a routable IP address Because the routes are already configured by nefsa So that was about the about the IP connectivity now coming to to service discovery service discovering DCS is done through Two components one is Spartan and one is miss another is missiles DNS for each Task or service that is launched on DCS cluster There is a there are certain amount certain DNS records that are created There are a record that are created by both Spartan and Messos and also the SRV records Not to give you a high-level view how how they interact with the system or how they create these of this record So both messos DNS and Spartan that are running on the master pulls the state Dot jason that is the state that is exposed by messos and Then Spartan gets this information creates the record Necessary records that is an SRV record and gossip this information to the rest of the cluster So there's a Spartan as I was saying earlier there's a Spartan running on each agent and They gossip to get the complete view of the cluster Now when when the actual tasks get launched on a particular agent and it queries for any DNS record that Query is intercepted locally by that local Spartan and this response and respond to that Container or the service so literally the DNS if it is internal to the cluster it never leaves the agent and that's one of the benefit of Having Spartan distributed and running on each node So Spartan is this DNS proxy that intercept all the DNS queries that are coming from The services that are running on that particular host Along with being distributed it also has another functionality called dual dispatch. Usually the way DNS works is if you have couple of name servers then the DNS query will go to the first name server and For some reason if it is not responding it will wait for the timeout to happen and then the second DNS It will reach to the second DNS name server What Spartan does it simultaneously send queries to both the name server so that we don't waste time in Waiting for one name server to respond and whose ever response first Spartan takes that query response and send it to the client So that gives a little bit of speed up in DNS resolution Beside that Spartan has this ability to have Authorative you can configure a domain and you can configure an upstream for a domain Like for example, if you have an upstream that can only handle Com as the TLD then you could configure multiple upstreams for different TLDs The way dual dispatch works is When there is a query from any agent then Spartan itself could that query but it sends that query to to Upstream name server simultaneously and whoever responds first it takes that response send it to the agent and then it also stores the metric For future so it remembers which name server responded first so that in the next query It can choose that name server as compared to the slower one So if there is a task that is running on an agent Most of the time the the DNS is are local to the cluster So the DNS resolution happens local to that agent But if it is something like external which is like kind of dot-com then it goes to the upstream Which is configured for Spartan or if it is dot messers then it goes to the messers DNS running on the master So that was about service discovery through DNS coming to load balancing part So load balancing is done at two different layer and two by two different components So at layer four, it's also the east-west load balancing within the cluster And it is done by Minutemen The layer of seven load balancing is done through a component called marathon. I'll be which is a wrapper around HAProxy So we'll see both of them today Minutemen Minutemen uses something called IPvS load balancer which is in Linux kernel so it programs the IPvS entry into the Linux kernel and They are on so the control plane is done by the Minutemen, but the data plane is entirely in kernel Full load balancing is happening in in the kernel itself and the algorithm that we use today is Weighted lease connection, but the weight is one for all the connection. So it's pretty much the lease connection The way you use whip in DCS is is through app definition When you launch an app through marathon You need to provide this JSON file which contains your configuration for the app and in that you could specify a label saying whip which with the whip name the actual whip that The whole the full fully qualified DNS that is generated is is the bottom of the screen Which is web server dot marathon dot L4LB this DCS dot directory the reason the name is so big because we wanted to Have an ability for multiple frameworks to have same app with the same name For example web server could be with some other framework So the the way the DNS is expanded is there's a service name dot framework name Then dot L4LB this DCS and directory At a high level Let's say there is a task that is launched by marathon on a particular Agent so it will it will convey that information to the master master will select one of the agent based on the resources and so the agent is So let's say the agent selected by master is agent one and the label that you are seeing four Colon five thousand is the actual whip But the task So the whip whip port is different from the actual port on which the task is running in this particular case It is let's say it's six seven eight nine Minutemen which is running on agent one locally pulls the state of Messers on that particular agent and that's how it learns that there is a task launched by a whip What it does it gossip this information to all the minutemen so that they can also create a Record the and they can also program their IPvS entries So each minutemen when it gets this information that there is a task launched by a launch whip And it needs a load balancing it takes that and assign an IP address Which is local to that agent and it also programs the kernel with with that information so the So the whip IP is one two three four colon five hundred Colon five thousand and then the actual back end is the task x with the actual port Now when tasks to say randomly it wants to connect to task one through a whip It it first queries the DNS for that whip The the DNS is intercepted the DNS queries intercepted by Spartan and responded back to this agent with the actual IP address that is a whip IP address colon five thousand and then When it when it tries to connect to that whip the IPvS entry that was created by minutemen in the kernel Intercept that connection request and forwards it to the actual task that is x colon six seven eight nine That's how the load balancing is working in in DCS Now coming to layer seven load balancing, which is done by marathon LB It as I said earlier, it's a wrapper around h a proxy So it takes the configuration basically it hooks itself to the marathon event bus So as soon as there is our application that is launched by marathon it gets the event And then it programs the h a proxy accordingly So as so if a client connects to marathon LB marathon LB already watching for For the task that are launched by marathon it configures the h a proxy for those tasks and then When the client actually connects it load balance them for these tasks the way you use Marathon LB of the way you instruct marathon LB For any particular task is again through the app definition in marathon and the and you have to specify labels those there are like Punch of labels for different configurations. So everything that an h a proxy can be configured with is there is a corresponding label for that So in this case, there are two labels that I'm showing one is the external which says that the Configuration that you're creating is for the external client and the second is Second is the we host. This is the DNS at which the external client would be connecting right and This is something that we are working on currently and should have Support in future. So we are working on something called IPv6 support So far our DCS network stack is only IPv4 But going forward it will support both IPv4 and IPv6 then we are working on something called CNI spec versions 0.3 which introduce a nice concept called Service chaining so imagine that you want different services To use from different CNI provider today You cannot really mix and match those CNI providers into a singles virtual network But with service chaining you would be able to do that and then a support of multi-tenancy today any Operator can use any virtual network to launch their container. We want to make it more Secure in such a way that an operator should be able to define rules and only Certain users or operator can launch containers or certain networks and should be forbidden to launch container on others So this is something we are working on building and finally the the stack that we started with So this conclude my talk today and I open the floor for question and answer Thanks Deepak For questions we have a microphone here. Do we have questions? One question there Hi, thank you. So I have a question specifically the DCS Barton component you are saying that you can Configure it to use the dual dispatch mode So in that particular case I'm interested if you are limited to the number of upstream DNS resolvers that you can configure Because as far as I know it The dual dispatch is made to two separate master nodes, right? So What happens if I? Accidentally or if I configured three upstream resolvers and it selects just the two ones that Don't resolve that particular record that I'm looking for. So am I limited to just two in this case? So that's a good question. So Let's say you have multiple upstream that you have configured. Yeah, it will randomly select two each time when it does the selection It does it randomly So let's if if to it selected Unfortunately, both of them are not responding then it will remember the next time when it selects then it will give lower weightage to these two And it will select the third one So over time it will learn that there is an upstream that always respond and it will always include that upstream in the dual dispatch But yes, initially if your upstreams are not responding then at least one to trial It will take to to get that information that your upstreams are not responding back Okay. Thank you other questions I Wondering if you are considering switching from IPFS to eBPF in minutemen Are you considering this solution IPvS to IPvS to eBPF because yes, so We want to the only Challenge there is that eBPF require certain minimum amount of a minimum version of kernel like Linux kernel I think it is four dot four dot plus something But many of our customers are running really old kernel. So there we do not have a support for eBVF But we are moving like we want to provide this support for the Kernels that are already there. So customers should be free like in future They will be free to choose whether they want to use IPvS or eBPF depending on the support in the kernel Okay. Thank you Are the questions Comes to marathon lb when you have syntactic error in the labels or The HAProxy won't restart and so the cluster isn't reachable anymore. So are there any plans to? work on this Sorry, I didn't catch the question there when Essentially when you deploy a container with a label that has a syntactical error, okay And so HAProxy won't restart right all your your whole cluster isn't reachable anymore. Are there any plans to? work around this I guess not like you need to fix that problem Right, but your CD impact your one developer sets a label and a whole cluster for so you need you're saying Is there a wave? The validation can happen before everything goes right? Yeah Such like a dummy restart or something like that. Yeah, I mean, that's a good point today We don't do any validation at least on the app definition level. We could we could think about adding that or That MLB is like an open-source project if you have something in mind you could also Contribute to that. I Think there's a pull request for that issue pending since a few months. Okay, I Think we can also write a Grammarly plug-in every second time I watch a YouTube video. It Recommends me to use Grammarly. Maybe it will help you in this case other questions One in front. I have a purely almost theoretical question So what if I do not want to use IP the container? Mm-hmm What if I can I use the network isolator without that and Let's say that I'm in High performance trading Where the distance between the router and the rack actually matters? Mm-hmm, and I want to bind the container To the to a network card. How does C&I help me there? Mm-hmm and Also, what is the performance hit when I use one of the well for example the proposed bridge overlay network Right, okay So there are a couple of questions The the first question is Whether C&I network will help you to connect to a particular network card, right? Yes. The answer is yes Actually the way C&I works is it the logic of how the network has to be done is in the plug-in itself So as long as plug-in has a particular logic as how it want to lay out the network It will work with missiles and DC us because as I was saying The isolator the network isolator is simply creating the network namespace and handing over this namespace to C&I plug-in Now it's up to the plug-in how it wants to connect so it can pick up a particular network card and connects your container to that Now coming to second question. What is the performance hit of using bridge mode or overlay mode? So so both bridge so From the technical point of view we need to understand where the performance hit might come The profile in in networking world the performance it always happens if you have to copy the packet The actual packet usually why these switches and routers work blazingly fast is because they always work on headers They never pull the entire packet in memory But if you have to pull the entire packet in memory because you are doing some sort of a nat or you are doing some sort of Encapsulation then you will have a performance it and usually the performance hit is like 40 percent Just male mainly because of the fact that you you are taking the entire packet and encapsulating into another packet That's with overlay with bridge The Linux bridge implementation has improved quite a bit, but it still has a hit Again because of the copying of the packet now there are implementation like there is there is this fast bridge in OVN Which which has in removed some of these performance hit so you could use that for the performance improvement But as long as you are using Linux bridge native Linux implementation or the overlay you'll have a performance hit Just these answer all the questions Do you have other questions? Then all right. I think we can thank Deepak with an applause Thank you You