 Hey friends, thanks for joining me for my talk. My name is Luke Peeley. I live in Colorado and Joy Parkour My dog's name is Pepsi and I'm a Filipino-American. I have stayed a lot with Sig Cluster Lifecycle in the Kubernetes development groups. If I know you from there Thanks so much for coming to support the talk and come and learn with me If I don't know you yet, please come and say hi My dams are open on Twitter or you can help on the Kubernetes Slack. I work with the developer developer experience team with we've works very lucky to be Friends when teammates with such wonderful people were the primary contributors to the flux CD project But flux CD project does have a open governance model So please come and build the best get-ups tooling available with us We are helping people adopt flux to lately There's a new set of APIs and CRD driven controllers that do a modular split of fetching and sync and apply to the cluster If you want to learn more about implementing get-ups from a cultural perspective with your teams and with your organization Come check out our get-ups community website Today I am happy to be discussing some strategies for multi cluster routing and networking and Before we just get into routing packets We want to talk a little bit about the rationing off of multiple clusters One of them might be workload proximity Say you're a breakdancing apparel brand and you have huge user bases in Korea and France You might want to run some services for your web store in those regions to reduce latency and improve reliability And also maybe there's specific services to Korea that aren't relevant in France Similarly if you're in that situation, you might want to separate your failure domains So that say your cluster is failing in France You can still do the special services to your users and your your fans and stuff in Korea Another thing that we see often with Kubernetes is like a desire to split up compute so that Application is segregated to the particular nodes And you can do this with the Kubernetes API, but it's kind of complicated Uh, you have to use a combination of RBAC namespaces network policy and the flawed node selector admission controller To decide that different namespaces can only schedule to subsets of label node selectors So if you've got a certain nodes labeled low latency That could be like say the entire cluster or half of it But then a portion of that could be ephemeral and then others Could have a node pool for persistence storage You could slice this up in any number of ways using multiple name namespaces and the admission controller for it Super cool. Very flexible. It's more flexible than the alternative which is just to split up Into two different control planes in separate clusters And give teams full access to that but one is easier than the other arguably So this is the common reason that people will split Their compute makes it really simple and often those control planes are free. So One area where you start to hit some not just complexity But actual constraints of the Kubernetes API is with non namespaceable objects when you Modify the API server with a custom resource definition Then that's not something that can be easily be namespaced Multiple people are going to be stepping at each other if you allow them to write CRDs potentially And in addition some you know controller implementations might be using an API that doesn't Tenantize in a very secure way. So you might be inspired to separate your cluster boundaries Beyond technical reasons, maybe you just have social billing reporting reasons organizational reasons One team might have more resources than another and one team might be incredibly disenfranchised because of the latest three organization Some other mission lanes reasons that I could think of maybe you're trying to use some novel services Some novel features from a Kubernetes service provider. Say one cast provider does edge really well and gives you secure enclaves There's also hybrid clouds Environments or you might be doing a migration in a hybrid cloud environment Need multiple clusters for that. I've been in that situation before Business to business networking say your cluster you have this business relationship with another Company that has a cluster with an object store in it There's some valuable stuff in there that you have contractual access to how do you open up the network paths to access those objects? That object store over the network from your cluster And then there's mergers and acquisitions which would naturally produce, you know the need for Networking between multiple distributed computers like Kubernetes So you end up in this world, you know, it is like lots of little and big clusters Some of them are shared by multiple people others are running a single application one thousand times in a bunch of retail stores or on trains and so you inevitably need to be routing some packets between some of these clusters And the problem is that the basic unit of compute inside of Kubernetes you get an individual IP address for each one of these pods and So we have this service abstraction that allows us to label select those pods and then keep an endpoint list up today Then other things can just watch the endpoint list Right, so things like nodes cloud provider load balancers Ingress controllers they can stay up to date with what the back ends are So that way when they become available they can get removed from the endpoint list And then you get a new pod with a brand new IP address and assuming it becomes ready Then it gets padded into the endpoint list but the problem is that Those things aren't the actual IP addresses in the cluster It's not just the pod IP addresses that matter services have virtual IPs Nodes you might have a couple hundred of those they might have a bunch of necks There might be multiple node pools with different nick configurations The cloud provider might be producing IP addresses for multiple load balancers Some of them being public on the internet and some of them being IP addresses in the vpc Your ingress controller may be fronted by an IP forwarding rule Or some server inside of your cloud provider's control plane Or you know say you're using a bare metal solution that provides an ingress controller IP from a node Port powered node then the IP addresses are completely different And it becomes you know a little bit difficult to track all of this IP information You know, you're still declaring it but it gets dynamically created by whatever the state of the environment is or however It was provisioned And this is a lot of network turned to track So it's cool that oh, sorry. I forgot about port maps Between each of the IP address paths here There's also a potential for a bunch of port mapping to occur So there's even more information that you're gonna have to deal with And so the kubernetes api lets you kind of encapsulate all of this network drift But then you have multiple clusters How you resolve the drift between two clusters say and your data center Where you want your legacy apps to be able to access the services that you're now hosting You know, you have this really modern fast service discovery infrastructure inside of the cluster But none of the rest of your older infrastructure can keep up and even between multiple clusters Now you have this problem like how do you get things to talk to each other? one Kind of beginning solution the first thing that people will be introduced to is to use something like service type node port or service type load balancer This is a layer four abstraction that allows us to basically get a single IP address or IP port combo To map that forward or proxy traffic to service virtual IPs on each of the edge routers the nodes inside of the kubernetes cluster And so basically get a map of the service VIP to either node IPs or something that's provisioned by your cloud provider or other tool set And then with that IP address if you're looking for a more declarative configuration You can use controllers like external dns or cert manager To then get stable naming and tls identities to mount within the workloads for those pods Here we can see okay Well, if we spin up a load balancer That's the yellow thing inside of the bottom cluster and we're able to reach out to it from those green back end pods as well As from the legacy workloads or whatever we're running inside of a data center or other infrastructure Now reaching beyond just a single service because load balancers and node ports They give us a single service abstraction ingress controllers allow us to potentially route to many services and it does this because it's a reverse proxy Um, it's an api that describes how a reverse proxy should behave So then you have these ingress controllers implementing the ingress api And we can route One network identity to many service back ends based off of the content of the protocol that the client is speaking So this allows us to do one-to-one and one-to-many setups. I mean that in terms of namespaces and ingress objects sometimes you can Have a single ingress network identity Hooked up to a single ingress object in one namespace routing to multiple services in that namespace Other times you can collect multiple ingress Objects from across all of the namespaces in the cluster and have it link up to the same ingress controller It's primarily layer 7 abstraction, but some ingress controllers let you do layer 4 stuff at the same time and It's really important to note that these one-to-many setups that difference that I was talking about allow you to really reduce the external network control Network churn that happens in terms of ip's outside of your cluster So with cloud provider ingresses You usually get to expose like some kind of internet facing or vpc accessible ip address And it's usually per ingress. So you can still route to multiple services within the same namespace But this tends to be a constraint of their implementation. You can't just like route to everything inside of your cluster pretty good design decision, but also that's a serious constraint that might not fit your use case And these things are often very powerful because they're usually in some kind of horizontally scaled cloud provider Infrastructure, but that can also make them incredibly expensive Which can make them a prohibitive solution if you're trying to just do low volume inter cluster traffic or something like that So if you find that that's not a good solution for you, you might look at maybe hosting your own ingress controller inside of your cluster There are a lot of third-party solutions for this like contour traffic You know, uh, etc. Inginx ingress is so popular And if you deploy these you often get a one to many set up So if you have a single ingress controller, all of the ingress objects could potentially Set that ingress class and get aggregated to that single reverse proxy And this composes really well, you know behind something like the layer for load balancer solution Or a node port and you can start meshing in between these ingresses and do some pretty interesting network topologies For declarative configuration We use the same external dns and cert manager to get stable dns And cert manager actually gives us a bonus win here because the ingress api supports tls termination Which allows us to decouple TLS certificates from the application workloads, you no longer have to restart your apps If you say want to change your domain name or add one or something like that or even just rotate the cert For automating the tls certs, it can really help to use a wildcard dns Plus the tls certificates This will lower your deploy latency because then you don't have to constantly be requesting new tls certificates and Creating new dns records If you have a dns zone that's api enabled This is ideal because let's you create the dns record and then also you can do something like acme v2 with let's encrypt to Refresh that wildcard cert Here we can see an ingress with an external Externally accessible internet accessible ip That's just provided by the cloud provider And so as a natural consequence of being on the internet suddenly anything with access to the internet can talk to it Pods from inside of a cluster Workloads inside of a data center as long as they have some route to the public internet then they can then get into the other cluster Here we have a different problem The ingress controller in the lower cluster is a self-hosted one. So it doesn't have an external ip address So we front it with a load balancer in yellow And the load balancer provides the network id that then we can route traffic to using the pods and Workloads inside of the data center You can see that if we start combining Ingresses in one cluster and ingresses in another cluster Then you can get these one one too many abstractions between Many different topologies in your or many different locations in your network topology You get this like kind of mesh behavior Starting to form where you have a lot of Access to things that you've provisioned. So super cool Now one constraint I want you to notice here though is that there isn't any ingress into the data center Because for that you would need to say open a load balancer on your data center side. That's more traditional operations This also is an area where you could start potentially thinking About a route sharing solution With route sharing One of the primary goals here is to make the pod ip addresses that are constantly changing normally inside of the Kubernetes cluster natively routable not just on the nodes the routers of the cluster But beyond the nodes beyond the cluster perimeter One use case to think about here be like if you wanted to run an ingress controller Outside the cluster right actually run the ingress controller infrastructure outside the cluster Then that ingress controller infra would need to be able to route to the pod ip addresses that it reads from the api servers endpoints and so Like This is this is one use case another variation of this is if you were to say want to run that ingress controller inside another cluster Same thing, but just a little bit more cheeky and clever Now it's not just about pod ip's We also might potentially want to be able to route something more useful like a service virtual ip Uh, this is kind of tricky because the virtual ip doesn't exist in any one place. It exists in the entire cluster But it's possible to use the endpoints api to determine Which nodes a service Has ready pods on if you can figure out which nodes those pods are running on for that service Then you can take all the node ip's and advertise them as a available routes for service Eclipse and you can even weight them And if you are putting equivalent routes into a routing table, then it's important to know about this concept of ecmp ecmp equal cost multi path routing This is supported by the linux kernel. It's supported by a lot of standard networking hardware if you have your own switches in your Data center co-location, you might take a look at the manual and see What is the ecmp behavior of the routing tables of that switch? or other networking equipment and For instance, I used to work with the switch In the side of a kubernetes environment that you could advertise an arbitrary number of equivalent routes more or less That routing table could get super long But only eight of those equivalent routes would be hashed and used at any one time and the Benefit of that is that say one of those routes actually became unavailable because we were rotating a node out of the infrastructure it very quickly drop out of the routing table as a healthy path and then There would be load balancing and failover to other nodes because of ecmp The protocols that are used to accomplish this kind of route sharing between things like routers are typically bgp and ospf These kinds of protocols run the internet, but they're not that scary The ospf is popular within private networks because it is only for single autonomous systems Whereas bgp is an inter autonomous system protocol So you can deal with multiple autonomous systems sharing routes You can mix these protocols so that one part of your network uses ospf and the other part uses bgp and the routes will transfer properly So here in the route sharing diagram We need the routers of our kubernetes cluster as well as the routers of our data center to talk in the kubernetes clusters Just reiterating that the nodes are the routers of the kubernetes cluster The nodes are what allow you to be able to figure out where the virtual enix of pods are And they also are programmed by things like kubernetes or your kubernetes replacement program to Have special ip tables or ipvs rules that will do the service bit load balancing So very cool. Those nodes can talk between each other and share routes for pod ip addresses and service ip's And the nodes that back them and they can also share that same information with your traditional Running infrastructure inside of your data centers or whatever network for apric you have Now for route sharing it's important to note that each cluster needs to have unique service subnets and pod subnets Basically, you can't have collisions because an ip address can't live in multiple places You want to have a unique route to be able to go there So each cluster has to have these separate subnets and that can be a tricky challenge sometimes Especially if you're traditionally deploying kubernetes with overlay networks Sometimes people will take a shortcut of using a default subnet and just having it be the same in everywhere So, uh, this is something that you might want to think about that's definitely a constraint Or requirement of the solution For technologies that you might want to look at this is incredibly short non exhaustive list Calico is super popular. It's based off of an older project called bird A kube router is a very elegant single go program that does network policy CNI and a kube proxy replacement in a single program. Go check it out. Very very cool Um, I've deployed both of those things, but I haven't deployed romana, which is just another bgp-based CNI You should go check that out Um for osbf go check out frr routing Um, and the kubernetes project Which was using quagga before You also just want to read the manual for your own router to see what kind of protocols it supports Go take a look and mess with this stuff really approachable It's not that scary as long as you're not at internet scale and accidentally taking down, you know, somebody's network Uh routes are great, you know when we're just working with ip's but as people we want to configure things using dns Right we want names for services and when we talk about service Vips like we don't think about the actual ip address of that thing typically we think about Resolving it using the service name in the namespace and then the service cluster local domain This is only accessible typically within the inside of a kubernetes cluster The service that allows you to do that is called core dns And it's very extensible and configurable One thing to know about core dns. It's it's always the 10th ip address So it's not just some random ip inside of the cluster. It's the 10th one and um, if your service subnets are unique and you are Sharing the routes for services across your clusters Then you should be able to talk to the dns server of each cluster no matter where you are inside of your route sharing infrastructure So that would let you do something like dns forwarding Where say each cluster knows which zone it has and it's also knows what all of the zones of the other clusters dns servers are So say i'm in cluster one and I get a request for cluster zero I can send that to cluster zero and get the answer back Um in sort of this kind of dns mesh You could also just use an upstream dns server topology if you want to collect them to a central place or advertise those things on the internet Another common thing might be if you have a private dns server or a split horizon setup on your vpn Then you could See how that would enable you to use your route sharing infrastructure plus dns forwarding To access services by name from your laptop or from workloads inside of your data center This is a super and fun environment to be in. I really miss being able to have access to this kind of infrastructure and so Because I missed that I built a demo I would love for anybody to be able to play with this kind of infrastructure. It's typically expensive and kind of a niche opportunity It's hard to get an environment where you can play with this kind of thing without having some sort of deadline or cost factor, so Why don't we get into an environment that lets us play, right? If you clone my repository It's at stealthy box Slash multi cluster dash get ops my name or my my github handle stealthy box Then uh, you can clone this repo and fork it And then just run kind setup And kind load it'll take a few minutes depending on your computer and what you already have downloaded This will get you three kind clusters Without any kind of cni setup at all And they will all have Configuration for different subnets for gods and sources You can see here cni is the default cni is disabled because we're going to use calico And then the load image script just preloads everything for you So once you have this infrastructure Go ahead And just follow the readme We will want to do a coop cuddle apply To the kind cluster zero where we need to bootstrap cni and so let's just Expand the customization directory for the coop system namespace Let's just do that real quick This needs to bootstrap flux Or this needs to bootstrap calico so that we can actually do the flux bootstrap Which is the next thing that i'm going to suggest we do So uh in order to do that i have to export my github token This step is just slightly different for me because i'm using a hub token here But you can just put your own in here My hub token is modified to also have access to repos and ssh keys Which is two of the permissions that you'll need It's mentioned in the readme So i'll do a flux bootstrap with my user stealthybox It's a personal repository called multi cluster githops And i want to sync the path config cluster zero and apply it to the cluster So if you look inside of config i have a folder called cluster zero with a bunch of namespaces with a bunch of config in them So we'll get flux bootstrapped ready to go And throw that off to the side Here we can start talking a little bit about what kind of configuration we're loading so inside of The kubesystem namespace There is a customization that loads from libraries the calico coordinates and serf deployments Calico this is just set up to do the subnet advertisement stuff That's done with this patch right here. So we advertise the cluster ip's for the specific subnet of cluster zero and then cluster one cluster two i'm disabling this just because uh It could break setups depending on your computer Now the serf configuration I want all of the nodes in the cluster to join a serf cluster using multi test You could also just use some well IP addresses like the ip's of your controlling nodes since those will change um you know your workers will And uh, basically it will use the serf cluster to solve the problem of network drift of node IP addresses So each serf node can then know its network identity and advertise that uh to the rest of the serf cluster And then if you look inside of the library serf deployment There is a query deployments that a beautiful bash script I wrote that is actually a pretty well Of functioning reconciler It even has like racial exit And the function of this bash script Uh is to convert the serf members into bgp peers And then also template the core files for the cluster to mesh between all of the other clusters So uh, you can you can read through this code. It's idempotent Uh was pretty fun exercise. It's just jq bash kubectl calicoctl stuff Uh was really fun to write this You can also figure out how to do graceful exits in bash. This was a fun one to figure out with traps and Jobs that are being weighted on And uh grabbing like the job list Anyway, that's just funny seregs. I'm sorry. I apologize in advance Uh for using that much bash but uh Basically the other secret kind of Hat trick that happens here is that cluster one and cluster two have a very similar setup in kubes system Uh in each of these clusters, we also have a pod info deployment Just so that and a debug deployment so we can mess with the network And in the flux system apply directory, we have cluster one and cluster two apply Which is using the new flux toolkit customization api remote access feature to Basically apply from this management cluster all of the manifests in the git repository to the other ones So pretty cool stuff. Go check that out Uh with the single flux bootstrap done On our cluster We can start examining some resources and see if some of our networking is working So I think the first thing I'll look at is just Do we have Our vgp peers So here we are in cluster zero. We have vgp peers for cluster one cluster two Because these are all peering properly. That means that the surf cluster is built And Similarly, I could change context to look at Cluster one and you would see that it's peering with cluster zero and cluster two So that's a really good sign The other thing that we would just check is I want to show you the core dns deployment. It's slightly extended Um, so you can read the patches here Uh, but the core dns deployment Coup system describe Config maps prefixed with core so This almost looks like a normal core file except In addition to the cluster local domain for the kubernetes services We have extra kube zones from the environment being templated in And then we also import any other core file snippets from etsycordians.d We have separate config maps then. Here's the core dns config that mounts to that location This is being created by a surf query controller It's the bash script that we just read We can see that we have an additional core dns configuration extending on that link from a separate config map That is taking cluster one dot lan forwarding it to this place specific 10th address in the 10 101 service subnet And the 10th address in the 10 102 service subnet for cluster two dot lan Hopefully I won't have to restart core dns for this and having a few problems every now and then For core dns env we can see that extra kube zones Uh, it sets cluster zero to that land. So This core dns server knows that it's in cluster zero and will respond not only to cluster dot local But cluster zero dot lan for any of the service requests that we have So If we get into our debug pod Uh, ideally we should be able to say Just check that Core dns is function. So we can look that pod info Uh, we can get the pod info service That is from cluster dot local It's the same IP address being returned We should potentially be able to do cluster zero dot lan Um And we can also get a completely different address for a completely different service in a different cluster by changing to cluster one dot lan We might also get lucky and be able to resolve a service for cluster two dot lan and If all of our route sharing is working because we already know that dns forwarding is functioning We should also be able to curl from kind cluster zero pod info in the default namespace of cluster two At port 98 98 And there we have cross cluster communication without the use of any node ports or ingresses or load balancers purely route sharing and dns forwarding So pretty cool stuff If you want to take a look at more what's happening under the hood I would highly I really like it actually if you were to Check out the project on github and fork it yourself And if you can replicate these results in your own environment The best part about this is that it costs really nothing as long as your laptop is strong enough to run a few kind nodes so Super fun had a ton of fun putting this demo together and would love to hear your feedback for sure so That's really cool To me at least hope that you have some fun putting that together. So just Kind of recap in that example. We just had a third cluster We had no upstream dns server or additional infra except for my laptop Which is not part of the routing mesh But we had three clusters that were doing dns forwarding and route sharing between each other with several domains And then we had a pod info A deployment in each cluster and we were able to curl between them. So pretty cool stuff Go check that out So for route sharing and dns forwarding We can extend the native kubernetes service discovery and routing mesh beyond the cluster but You do incur a another single hop in the service routing or doing it the services this way As an alternative you could try to have service controllers, you know programming Each of the nodes across the cluster and doing route sharing in a little bit of a different way You could eliminate a hop it is possible to do that kind of thing This is a layer for abstraction So there's no like built-in encryption that's happening, you know as the packets transmit like between the clusters that's not really like a Given if you You did see me disable the ip and ip encapsulation. That is a feature of calico if you have calico on both ends but yeah, like Outside of that you're going to have to look into your implementation to figure out if you need to encrypt your packets say if they're traveling across an untrusted network or the open internet This route sharing and like layer for adjacency and routability of pods Is also typically a prerequisite for multi cluster service meshes So if you are interested in doing multi cluster mesh, you're going to have to know a little bit About route sharing and this kind of infrastructure setup Some technologies again, we have a non-exhaustive list, but these are some things that I think are fun With weave we have weave net, which is one of the older cni's that's been around for quite a long time We've met emulates a layer 2 network Using like mac addresses. It's pretty pretty neat We've met allows multicast so you could do something like the surf cluster That I have deployed on top of the weave net Set of nodes and it's pretty easy. We've done a previous talk before luke marston Demonstrated we've met being meshed across multiple kubernetes clusters for pods of power routing for Something a little bit more interesting you could say create a wire guard network Silium allows you to use wire guard and they have some of the coolest edgiest non-standard multi cluster mesh technology out there right now Go ahead and check out silium. They actually have a route sharing network policy Sharing implementation using a bunch of ecds instead of something like bgp And then the other super cool category of network Tech that I think is out there right now that you should look at All right, this category of two-way udp hole punching NAT traversing private networks So there will be typically this kind of like public coordination or lighthouse server To use the tail scale and nebula terms that allow Nodes inside of private networks behind NAT to find each other over public networks and You can look at tail scale which uses wire guard under the hood Slack uses nebula for their own kind of corporate You know reasons Or for their workforce reasons and nebula uses the noise protocol to encrypt traffic And it's able to do this kind of double NAT traversal without any kind of public machines actually routing traffic through them Zero tier is another solution as well and their new version Like nebula supports multicast, but tail scale does not support multicast so As far as going further beyond trying to defeat NAT And encrypt your traffic on untrusted networks while still doing fancy route sharing and DNS forwarding We didn't talk much about how to do network policy between clusters As far as I know, this is pretty unexplored space I haven't played with the cilium network policy stuff enough to know But I think what I do know about cilium is that it does tag network packets In a way where this could be possible across the clusters For something more A theoretical you could look at the Well, I shouldn't say theoretical because there's an implementation now Go to kubernetes 6 mcs api. You can also read kepp 1645 This is the multi cluster services api that allows It just is a specification for which clusters can share about the endpoints lists of other clusters So that's one way to figure out how to get koo proxy to get rid of that extra hop koo proxy can now be mcs api aware For cross cluster service meshes, you might look at things like istio link or d Maybe assembling your own kind of thing for a particular use case Also take a look at console connect And for inter cluster orchestration has a final kind of tidbit of things to chew on You saw us With flux we were able to To apply a single bootstrapped get ops directory structure to multiple clusters And we were using the customization toolkit api to control these applies You can actually create health checks and dependencies between individual customizations within a Set of toolkit controllers And if you were to deploy flagger In each of the clusters using those customizations When we have the case status implementation of flagger finished You could actually do a cross cluster Not just a cross cluster canary, which is already possible and people are already doing that with istio multi cluster mesh and psyllium But you could actually orchestrate the dependencies between those Flagger canaries so if you're interested in Canaries and service meshes and traffic shifting and shaping Beyond just the normal facilities that you can do to accomplish zero downtime deployments and cross cluster traffic with kubernetes Without sharing and dns forwarding Check out flagger and check out dependency management with customizations in flux For today's demo again You can go to the github repository and fork it and do it on your own Play with it on your machine submit issues or ping me on twitter if things are not working I really want to hear if you're just learning something From this kind of environment because it took me a long time to be able to Get into a place where I could learn these skills and I would really Like for the next kind of generation of practitioners Or people from different backgrounds, even if you're very senior To learn about bgp and how to do route sharing in the context of kubernetes and cloud native applications Hit me up. I am available dms open on twitter Also my github if you want to follow me there And again the kubernetes slack as well I'd really like to get to know you Hear what you're doing with some of our technologies that we build at weave Or if you have a cool thing that you're doing with kubernetes Also, say hi in sickluster lifecycle and thanks so much for coming to the talk See you later