 What we're going to talk about today is IP Anycast and how to implement it with DCOS or Mezos, everything that I'm showing you here should work equally well with open source like Mezos or DCOS or whatever. So my name is Bill Green, I'm a site reliability engineer at New Relic and what I'm showing you today is something that we are sort of experimenting with in our data center and I should probably just say from the outset that what I'm going to talk about is kind of a silver bullet, it has sort of a limited number of use cases but for those cases it works pretty well but I'm hoping that in covering these topics we'll go over some stuff that's useful anyway like you know how to do networking and what plugins are and things like that so I'm hoping that you know it's useful broadly speaking. So the first thing I guess we should talk about is so what is Anycast, it's one of those terms that you hear a lot and it's not real, maybe it's not real clear what it is. So from a layer two perspective like on an Ethernet LAN or whatever you have broadcast traffic and you have unicast traffic and you have multicast traffic and so the difference between them are sort of how the addresses are or how the frames are addressed to stations on the LAN so broadcast goes to everybody, unicast goes to one station and multicast goes to a group and the members of that group can elect themselves to be in the group. IP Anycast works at layer three and it's not so much a feature of something we'll go into that so what it is is it's where you have the same IP address that appears in multiple locations in the network and so you let the routing protocol decide which particular address is going to receive the traffic and so the way that this sort of works is that the routers you know keep have a routing table and they have the destinations in the routing table and normally the routers will pick the closest or the lowest metric or you know some sort of administrative thing to pick one route and put in the routing table and that's the active route so even though you may have multiple routes to a destination the router normally only picks one. With anycast it actually will make use of multiple routes and this sort of thing has been in use for a while and it's if you've ever used like you know Google's public resolver that the 8.8.8.8.8 that's an anycast thing if you try to hit that IP address you'll get routed to like the closest 8.8.8.8 to you and it's not necessarily the same one every time so anycast like I said is an effect it's not a feature so if you go you know thumb and through your router manual looking for like how to turn on anycast you're not going to find it. The feature is called ECMP or Equal Cost MultiPath and all that usually is is it's a little knob in the config where you tell the router or the routing protocol be it BGP or OSPF or you know whatever routing protocol that you're using you tell the routing protocol I want you to use multiple routes so if you can see multiple paths to a particular destination I want you to use all of them and not just pick the best one but keep all of those in the routing table and it will do this if the routes are of equal cost and there's some other details in there like if you're running BGP typically it wants to keep the same destination AS but there's knobs to override that. This little knob for Equal Cost MultiPath will apply to BGP to OSPF both version 2 and 3 you can use it for MPLS if you're like advanced and it's usually as simple as just going into the configuration and saying how many routes you want to use and the router implementation will have limits on the number of equal cost routes that it will do. Usually it's on the order of like 1632 something like that so you would say like under your BGP config like I want you to use like you know four routes or whatever and that will tell it to enable this feature. So a contrived example of this let's say and I'm sorry for being US centric here my being an American I'm my geography is poor so I'm going to rely on what I know. So let's pretend like we have SFO, San Francisco, Chicago, New York and Dallas that you can imagine like a ring or something around the US. So if you're connected to San Francisco and you want to send something to something in New York you can go the northern route so San Francisco, Chicago, New York or you can go to the southern route San Francisco, Dallas, New York. So the SFO router will have you know in a contrived example two equal paths to that destination. So if you turn on equal cost multi-path in San Francisco it will it will put both of those routes in its routing table and so if it gets a packet from A to B it will basically load balance or load share on both of those paths. Now at this point the astute reader will probably freak out and say oh my god what if packets you know get out of order you know that kind of thing. So there's implications to doing this. One of them is that the ECMP will take care of the flows for you generally speaking. They will hash the destinations using some sort of hashing algorithm. Typically it's like they will take the source and destination IPs and ports and sort of hash those together in a group of buckets so that all of the packets in the same flow will go the same route and this avoids having you know packets from the same flow possibly being reordered if they're going in different paths. There's usually knobs in the router to effect how this hashing algorithm works. You can sometimes tell it to use you know source destination ports sometimes you can go as far down as like the Ethernet address if you wanted to. Usually on modern routers the hashing algorithm is a consistent algorithm so that if you were if one of these paths becomes unavailable you don't have to rebalance all of these hash you know the hash entries. You only lose the entries for the link that you lost and this has implications for things like you know if you're balancing long lived flows and things like that you don't want to disrupt everything if you lose one path and with that you usually don't have to. So we're going to talk a little bit now about routing in a data center. So we have this idea of anycast and so we have to talk about routing in a data center excuse me so we know how to sort of implement it. This is the usual this is the Cisco model that was popular for a long time and it still is in wide use and it has the idea of core distribution and access where you do this aggregation from like you know a lot of ports at top of rack and you sort of aggregate your traffic as you go north in the data center and so forth. This is not really a design that will scale. It does have some benefits like it's optimized to lower the per port cost at your top of rack that's really what it's designed to do. The drawbacks of this are pretty much everything else. It is really hard to scale this. It's expensive. It's usually there's usually a lot of layer two involved so you get these really large broadcast domains which are sometimes hard to troubleshoot. The alternative to doing it this way is with a so-called spine leaf topology. This is the more I'm going to say modern way to do it. This is becoming more popular in a data center. So what you do in this sort of design is you have your top of rack and you have a spine layer that's sort of your core and everything talks BGP so that each of your racks then become an autonomous system and they're kind of like their own little ISP or their own little network and so all of these things become little islands that exchange routing information with the spine as if the spine were the internet provider of these racks. At first glance it seems kind of crazy to do something like this but the advantages of doing this are that BGP is a very robust protocol. It's been in use forever. It's very mature. It's well understood and it's actually very simple to configure. It's generally as simple as just naming a route and saying hey I have this route. I want you to export this to everyone else and it will go do that. And so the topological features here are that you have usually a grouping of spines that follow a power of two. So you'll have like two, four, you know, eight, sixteen spines. And each top of rack is connected to each one of the spines. And equal cost multipath routing is used throughout a design like this and what ends up happening is that at the top of rack there if you can imagine that these are 10 gig links you have an aggregate bandwidth of 40 gigs out of the top of rack and it's load balanced among these different paths and if you lose one of those links the traffic will just shift over to the remaining three. So in the case of failure you just end up degrading your bandwidth a little bit. And you have like a numbers game where you can support a maximum number of racks based on the port density of your spine. Typically these are on the order of like 96 ports or something like that. So on a spine layer you'd support like 100 racks basically. And you would oversubscribe the bandwidth in your racks by like an order of maybe three to one or something where you might have 40 gigs aggregate out of your top of rack and in your rack you might have 30 servers connected at one gig each or something like that. So there's some oversubscription here so it requires you kind of be judicious about how you provision these things. So from a oh sorry. So the spine leaf topology so it's layer three everywhere you could do OSPF the more common that I've seen is BGP. Usually this is EBGP so it's external so every sort of thing here is an autonomous system among itself and it sort of like peers with everybody. The benefits are a massive scale out. You can sort of see here that everything every rack becomes one hop away. So it tends to equalize your latency so that every you don't have these crazy you know where you go all the way up to the top and back down to reach you know in a in a sort of more core access distribution model you don't have to go all the way up to that stack and back down. Each hop is basically just one bounce off the spine layer and so it's very low latency. And like I said with BGP it's simple it's robust it's mature. So what does this look like well you have like your top of racks you have your AS's and so you might have something like this where each rack is an AS. So you can actually bring this down to the host if you wanted to. You can run a BGP demon on the host and you can have the host in its own AS and then what you do is for everything that runs on the host you just simply announce a host route for that thing. Now your thing can be a container it can be a service it can be whatever and then in this way as your services come and go the routes are announced and withdrawn and so this sort of saves you from having to do like a load balancer to keep track of where everything is or do like complicated service discovery schemes to maintain you know to understand where your stuff is. But what does this look like as far as Docker goes because this is where the hard part comes in. So in a typical Docker deploy like you know your standard install Docker you usually run this in a layer 2 bridge and what this is is inside the Linux kernel you get a bridge device and you have a Docker zero there and you allocate these interfaces in pairs. So you might spin up a container you would allocate a namespace to the container so the container gets its own copy of the network stack and then you build a virtual ethernet pair and you give one end to the container and then you give one end to the kernel and that's how you talk across the namespace as you do that kind of thing. You can either and this is just a point to point link doing this bridge topology thing where you give one side to the container and one side into the kernel that's no different from like a t1 link or something where you're like linking a remote office it's exactly the same thing except in this case you're not routing it you're bridging it so this is layer 2. So in the kernel you collect all of these virtual ethernet interfaces that go out to your various containers and you put them in a bridge. This is just like a layer 2 switch so it's like you're taking all of these remote offices or whatever and you're putting them on a switch so that you have a layer 2 and you you know send the frames everywhere and there's some kernel trickery that that comes into this but this is basically what you have and the most the important point is that all of the IP addresses that get allocated to the container come out of the same IP block so this is like you would Docker usually gives like 172 17 you know in this case it's like 0.1 and so every container gets an IP address on this network. We can't use this to announce via BGP mainly because we'd have to announce the entire network. We would not be able to like go down to the to the individual layers of the Docker containers because there's only one interface here in the kernel and that's Docker 0 so you would have to either announce that you have all of the containers or none of them and so if one or two or three of your containers are not reachable you would black hole you would announce that you have them even though you don't. So what we want instead for this any cast scheme is we want what I just showed you except we want it routed we want to we want to still give every container its own namespace we want to make a pair of virtual ethernets we want to give one to the container and keep one but we want to route it we don't want to bridge it and the reason why we want to route it is because these become host routes so instead of having a bridge network we have these slash 32s which means they're just routes to a host and these routes go in the kernel table on the host and whenever you have routes in a kernel table you can then redistribute those via the BGP demon into the rest of your network and then as containers come and go these interfaces go down and come up or whatever and these host routes appear and disappear and thus they're announced and withdrawn from the network. So this is this is what we want to have and this allows us to we you can do IP per container this way so if you if you have some some IPAM scheme where you want to you know assign IP address to containers you can do this but in our case we want to assign the same IP address to a group of containers so we want to take you know marathon and we want to deploy you know in copies of an application with marathon we want each one of those instances to come up with this exact same IP address in different parts of the network and then we want this this IP address to be announced into the network so that all of the routers see multiple paths to the same destination or what they believe is the same destination but in reality it's multiple containers so that's that's what we're trying to achieve with this so the question is how do we do this and so that's sort of the heart of this so we have this idea of plugins with both docker and with mezzo so if you're not using the docker plugin or sorry if you're not using the docker container iser if you're instead using the mezzo's universal container iser this works in a very similar way and so on a on a debt on a mezzo's or DC us platform you have your choice of two different container isers in both of these cases you can write plugins for this container iser to customize the functionality that you want and they work a little bit differently but the end result is the same sort of so in the in the docker scheme they they have what's called the container network model and this is basically farmed out to a thing called libnetwork and so in this case you have a plug-in driver that runs in a container on the host and its job is to listen to API calls from docker to do things and these things are you know I need an address I want you to create a new network you know I have you know I want you to release the address so docker makes these various calls and your the custom plug-in actually does the work to do this and so this this container this docker plug-in is just an HTTP service it's no more complicated than that so you your plug-in is an HTTP service it runs as a container on the docker host and docker talks to it over HTTP and generally this is via a Unix socket and not a network socket because you don't generally want to make this available to the entire network and in the container network interface which is sort of like Kubernetes and the excuse me the universal containerizer that's what this uses and it's it's a similar scheme except it doesn't run as an HTTP service it's an executable and so the containerizer in this case just execs the executable and passes it arguments both via standard in and with environment variables but it's the same sort of thing it makes requests of the plug-in and says hey I have a thing I want you to create me a network I want you to assign an IP address etc the other difference here is there's really two parts to this there's the network driver and there's the IPAM driver and I'm gonna gloss over the IPAM driver but this is what actually allocates the network or allocates the IP address and so forth and it's sort of assumed if you have a custom IPAM driver that you have a larger IPAM implementation in your organization typically organizations like to centralize IP management for obvious reasons so like if you're gonna do the IPAM driver that's you kind of that's how you get it plugged into the rest of your your organization for these purposes I'm gonna assume that when you spin up a container you're just gonna tell it what IP address you want it to use and so it sort of bypasses the IPAM driver but you need to know that both of these plug-ins deal with that a little differently the Docker plug-in there's there's two separate drivers there's the network driver and the IPAM driver with CNI there's only the containerizer only talks to the network part and the network part is expected to talk to the IPAM driver on behalf of Docker but the functionality is the same it's just it works a little different in how the calls are made so ultimately what what we want to do is we want to you know have a routed network so this is where we start so usually the plug-in is given a container and the container is just a bare bones implementation of you know some sort of running process and it has its own namespace so what I'm trying to show you here is the dark blue is the namespace on the host the light blue is the namespace in the container and so the steps that you generally follow are you take this and he was you make a virtual ethernet pair and again this is what we're doing in the driver so the first step in the driver is you create the virtual ethernet pair and one end is going to go in the container the other end is going to go on the host you assign a dummy Mac address pair to assign to the virtual ethernet pair and you take the IP address and you assign the IP address to the interface in this case we're using the 169254 on the host side this is a link local address and it sort of gives us a way to route to basically do a route to an interface and I'll cover that in a minute but the host the host side of this is always going to be that exact same address and then what we do is we make routes on both of these so in the host side we take the host route of the container and we add that to the routing table so that the host now knows how to route to the container side in the container we need two routes we need a route that's a that's an interface route and we take the link local and we say okay this link local address can be reached by putting it on this link and then we make a second route that's a default route we point it at the link local address and this gets us out of the container and so this allows us to route out in order for this scheme to work you have to have proxy ARP enabled in on the host for at least the interfaces that are coming up and it's not usually an issue to do that but you do have to enable that and so at the end of all this this is what we have we have containers that are coming up with each container having an IP address the route to the container is contained in the host routing table you run a BGP demon on the host that redistributes these addresses out to the rest of the network and so what I'm showing you here is these 192 addresses these are the addresses that your host is numbered in so these are the this is the address space in your rack that you would have so each host is going to get like a 192 address but inside the host the individual containers are going to have a different IP address and it's going to be a host route and that the way that you tell the rest of the network that to route to your containers is you have to have the BGP peered with the rest of your network to tell it that you have these routes so if you have this scheme it will all just sort of work so sort of the use cases for this so why why would you ever want to do this there's a couple reasons why you would want to do this one of the reasons is for like UDP endpoints so like if you're running a syslog service if you know if you want to be able to you know just send your logs somewhere and you don't really you know if you have a few syslog services scattered around your network or whatever or if you have multiple data centers you might run a logging service in each data center you would want to be able to give you know an IP address into your host to be able to log to this stuff and so what you would do is you would configure your log service within any cast address so that all of your log collectors would have the exact same IP address and your host will use whichever one is topologically closer so that if you're in a data center let's say your host would tend to use the log aggregator that's in yours in that same data center but if somehow you lose that one you'll still hear the routes from the other data centers and so it will go to those instead it also works for things like you know like I mentioned DNS will do this for like it works really well for s-flow for net flow which are all UDP based services but you can also use it for TCP services there's not a restriction here so what you can do which is kind of interesting is you can deploy HA proxy and so the what you can sort of do is you can you can deploy your service like if you have a web service or whatever what you do is you deploy that and you use Minuteman if you're using DCOS to give it an IP address that's normally only reachable inside your cluster so that Minuteman assigns you know some sort of dummy address and you get a host-based VIP for your service so that you can run you know three copies of your of your HTTP service and you could put those behind a Minuteman VIP and if as long as you're in the cluster that totally works great but then the problem comes from like what if you're outside the cluster and so the usual solution for this is marathon LB or something similar where you run like an HA proxy so that relies on this sort of kooky scheme where you're like having to reload HA proxy like every time something changes and you know HA proxy was never designed to do this and they've gotten better about it but I I still am kind of wary of doing stuff like that so a better case I think is to just deploy an HA proxy container and assign the front-end like you know your server side of the HA proxy and any cast address and the backside becomes that named Minuteman VIP and it's static and that config never needs to change and so you deploy you know three copies of this with any cast and it gets deployed and so your your HA proxy becomes globally reachable from anywhere and if your containers move around Minuteman will take care of you know updating the you know the actual container instances but the HA proxy instance never needs to change you could do this with like if you're running you know traffic or you know any kind of load balancer that follows that kind of pattern you can deploy with any cast now the caveat to this comes from what happens when you lose you know when you lose one of those equal cost multi-path routes because the routers do have to rebalance those flows that you lose when you lose one of these routes and that causes the connections to be reset for just just the flows that we're using that one route the other thing is that if you deploy HA proxy using an IP any cast address you lose the ability to gracefully take your HA proxy instances out for maintenance so if you wanted to do something like this you would have to take the HA proxy instance down and it would reset all of the connections that are on there and they would be forced to use to reconnect to another HA proxy instance so as long as you've got applications that are willing to tolerate that it works well if you've got stuff that won't tolerate that or you've got or you know you're sending like large files or something via post or something like that were to reset these connections as problematic you don't want to do this because you'll run into issues so that's that's the sort of high level I glossed over a lot of the details but what I wanted to show you I guess I'm not gonna be able to do that so there's a I'll put on this slide deck when I upload it there's there's a repo from Adalia on GitHub that actually does the Docker plug-in that will enable the routed plug-in and it works out of the box you can just deploy it and as long as you use that as your Docker network driver it'll divvy out these these host routes so if you have a setup like this in your data center where you're running BGP or something else and if you're set up to where you can take your host and peer it with the rest of your network it's really easy to do this you just need to run a plug-in and the plug-in is called see CNM routed plug-in and you would just run this on your host and you would divvy out these host routes and suddenly your containers would be reachable from everywhere in the network and you wouldn't necessarily even have to use anycasts like I mentioned if you have like an IP per container scheme you could do this also and so it would make your IP addresses for your container reachable everywhere also so that's kind of the gist of the talk I wanted to open it up for questions because I realized I sort of went through this fast so if you know the word like if you've got four of the same IP address in one rack in one over here since the hosts are doing the announcement that the top of rack would get all four of those host routes and it would announce those so everywhere in the network you're just going to see five routes four of them happen to go to the top of rack are you thinking about aggregation so if they're on the same host in the rack that's probably not going to work because it will if you tell it to so that that's a thing like normally normally with a sort of garden variety networking you want to aggregate that stuff right you don't want all these little really specific routes leaked out everywhere because you end up with these gigantic routing tables so typically you would have something on the on the router that's like a you know aggregate supernet or something like that where the router will automatically take these more specific routes and aggregate them behind a single ad aggregation and as long as one or more of these more specifics are reachable this route stays announced and you don't have to do that you can you can specifically say hey if you've got anything in this net block just go ahead and announce the 32s into the network yeah so so the question was like if you if you're using it generally like IP per container and you're announcing all these slash 32s like what's the limit to that you don't want to have like 60,000 containers in a data center all announcing these host routes into your network it sort of understood that you're gonna have you know a small little bit smaller implementation than that but generally speaking you're gonna run out of IP addresses before you run out of you know this kind of thing I don't even if you're using private addresses if you have these just math like if you're Facebook you're gonna you know doing IP per container is gonna be very very difficult because even if you're using private addresses it's gonna get very difficult to divvy that stuff out one more question maybe two more good in the implementation I showed you yes now you could always get around that by you know I don't know you could probably use net filter trickery to get around that but generally like the kernel is going to make that decision so if the kernel has that route and it's local yeah it'll use whatever local thing I don't know of an instance where you want it to go out of the rack can you think of a like what's the use case oh yeah yeah right so like if you have like a bunch of stuff in that rack that needs to talk to that one thing yeah you would have that issue you'd have to put multiple things in that rack to sort of share the load in that case it it gets really hard to sort of granularly like do this kind of stuff with this scheme because it's not it's not really meant for load balancing in that sense it'll it'll like load share and not load balance if that's like a thing is there one more question like yeah so you it depends on how your IPAM is right so you can have your IPAM do your DNS for you so if you wanted to you can name these things and so you would have an IP address with a name and so in your service you know in your configs you would inject the host name and so the host name would just you would use DNS as your service discovery or you can use mesos DNS or console or whatever but but the IPAM that I sort of glossed over that's that's where all of that happens is like how you get that into the larger discovery thing generally and we're out of time thank you very much for attending I know it was kind of a very dense talk but I hope it was