 Hello, everyone. Welcome to our talk. This is Laurent. I'm Antoine. We're both engineers at Datadog and Today we are going to talk about networking and service discovery So I start with a few words about Datadog for those that don't know us. We are an observability product With a bunch of integrations. We have quite a large customer base already and because of that We have pretty big infra in order to store and query all this data Lauren is going to start by introducing a bit more about our infra and that's going to be useful for the rest of the talk Yes, before we dive into the Purpose of the talk which is service discovery to give you context and the constraints we have we're going to present a quick history of our infra so back in 2018 Datadog was running in a single region on AWS and everything was managed with Chef and Capistrano as We were reaching thousands of nodes. They started to be a challenge in addition We had new customers in Europe. We wanted us to send data to Europe And this got us the first one of the first big infra projects We did which was creating a new region on a new provider and because we're seeing limits with Chef We started to deploy with Kubernetes instead of Chef in Europe as you can see on this slide. We also started to do this in The initial region in the US, but only a small part of it was only on Kubernetes at the time Fast forward to 2024. This is far more complex, right? We're running in six different regions On three different providers everything in running in Kubernetes and and we run millions of containers What's important for this talk is? We run thousand to tens of thousands of nodes in each region Which means of course because of limits in Kubernetes that we need to have multiple clusters in regions and in some regions we have dozens and What really matters for what we're going to discuss today are these three things the fact that we're 100% Kubernetes that we run on three different providers and that regions can have up to dozens of clusters If we zoom in into a region because we have many clusters We need to assign workloads to clusters and in this very simple example here We can we can see three different clusters one that is dedicated to matrix application One that is dedicated to logs application and a third one that is dedicated to shared services such as Kafka Cassandra This looks simple enough, but for Reliability and scalability reasons we tend to deploy stateless workloads in zonal clusters, right? So you can see here that instead of having a single matrix cluster in the region We actually have one in each zone This gives us increased reliability because if the cluster fail if a cluster fail It means we still have two clusters to run the workloads and also it aligns failure domain, right? Like the failure of a zone of the cloud provider tends to have a similar effect as losing a cluster Of course this mostly works for stateless services if you have stateful services that are coordinated such as Kafka or Cassandra It's much easier to run them in regional cluster. So we have a mix of zonal and regional clusters Now that we have an idea of how we run things of course because we have many applications They need to talk to one another and this means service discovery and before diving into how we do things We're not going to give you a quick overview of how it works in in Kubernetes in general So When you start deploying a service and exposing an application in Kubernetes The first thing you're going probably to do is to create a service with the default configuration And it's going to be a cluster IP service and the way cluster IP services work is as follows So first if we have an intake pod that wants to reach the storage pod of the storage application The intake pod is going to make a DNS query. So here it's going for storage and DNS it's going is going to give an answer that is a cluster IP and What's important here is that the cluster IP is actually a virtual IP It's not the IP of one of the pods You can see here that pose of IP is in 10 dot something while the virtual IP is in 172 Of course because the intake pod got this answer It's now sending the traffic to the virtual IP which has to be translated into the IP of a pod for the traffic to reach the storage pods And it is done with a proxier. So I use proxier here because there are multiple ways to do this It's usually IP tables or a pvs and more and more ebpf And for this proxier to be configured so the component that is transforming visual IP into pod IP You have multiple solutions The the classic one was to use co-proxy and more and more people use cdm. This is what we do so Cluster IP services are great right there magic because Clients always see a single IP that's extremely convenient because they don't have to care about the number of back-ends The fact that it's back-end will scale up and down the fact that sometimes it will be unreachable Everything is totally masked to the application. They just see the single IP However, it's not perfect and here is an example of issues you can get with cluster IPs So in this example here we have 10 jrpc servers and 100 jrpc clients sending a hundred requests per second to these servers and of course This graph are representing the number of requests received by the servers and the CPU of the servers And you would expect this to be balanced, right? You would expect all the back-ends to get the same amount of queries Turns out it's very different so it's pretty confusing and The reason this happens actually is because of the way jrpc works If jrpc clients sees a single IP for a service is going to consider that there is a single back-end And it will establish a single connection on which it will send all the traffic So each client is connected to a single back-end The problem is if you have 10 back-ends and 10 clients in that example that each pick a back-end It's pretty much running a 10 10 dice and you can see here that some back-ends will get Three clients and some back-end will get none Of course, it's only with 10 clients if you have more clients. It's slightly better, right? This is a distribution of number of clients connected to each server if you have a hundred clients So it's better But you can still see a huge discrepancy between back-end 4 with six client and back-end 6 with 18 So as a summary cluster IP services are magic in many ways but the problem is clients have no control over load balancing and Widely use application rely on this right in addition so applications even rely on the fact that they can see all the back-ends And this is true for Cassandra or Kafka for instance So cluster IP services are the default service and communities but an option you can have is you can transform these services into headless services which are designed to address this issue and The way they work is very similar except you're gonna see here that when you do the DNS query Instead of getting a virtual IP as an answer We're getting the IP of all the back-ends, right? So the client gets all the IPs and then it's responsible for connecting to different pods So it's much simpler. You can see here that the intake pod is directly sending traffic to storage pod 1 and With no with no proxies So it's better in our case, right? But of course it means the client has to load balance and the client has to handle everything The processor was handling before which is manage healthiness of of the back-ends and Stoleness of the of the endpoints when you're scaling up and down However, it rather see the main challenge we we had with cluster IPs You can see here that we migrated a service from cluster IP to headless service around 1 p.m And suddenly the traffic is perfectly balanced So this solution isn't much better. However, it's still single cluster, right? So what happens if we want to send traffic across clusters? So there are multiple ways to do this and we're going to talk about the two main ones in Kubernetes Which are load balancer services first and then we'll talk about another one. I won't spoil you So load balancer services allows you to expose services across clusters So the only difference from the slide we had before on the step we had before in that in this case We have the intake workload running on one cluster and the storage workload working on another one So they can't rely on standard services When you use load balancer services, what's going to happen is your client workload is going to do any DNS query But instead of getting the IP of virtual IP in the cluster or the IP of the pod is going to get the IP of a load balancer back And the traffic is going to be sent to this load balancer, which is usually managed by the cloud provider However, this load balancer doesn't know how to reach pods, right? This load balancer has to send traffic to two nodes So what happens is all the nodes in the cluster are registered with a load balancer And then the pod so we were talking about before is going to be responsible for setting traffic to the actual pods So you can see that the data bus is a bit complex The intake pod as a client is connected to the load balancer Which is sending traffic to any node and the processor is responsible to send traffic to the actual pod Of course, this is a full mesh. So you have connections all over the place and It's pretty complex. It's pretty inefficient because you have multiple hops and And also you can end up situation where you have noisy neighbors because of course if you have a high throughput traffic And you're you can hit node 3 there for traffic That is that is that you want to send to pod 1 and of course you're sending traffic that is unneeded to to the third node So this can be slightly improved in Kubernetes by using load balancer service with the external traffic policy local annotation right and this is designed so Only nodes that actually run the back end you want to talk to are going to be healthy in the load balancer So you can see here that traffic is never crossing nodes boundary Traffic will reach will only reach nodes where storage pods are running and be load balanced by the proxy So it's better, but it's it's still not perfect right because nodes are still registered with the load balancer And the only reason some node are not getting traffic is because they're not considered healthy Which means we have an additional component which I call the health checker on the slide That is responsible for probing all the nodes to see if they're running one storage pod So load balancer services allow us to get traffic between clusters, but they create a lot of unneeded traffic Of course, there's also additional cost because you have to pay for the load balancers It still doesn't works for workloads where the client has to see all the back ends because of course you only see the load balancer and Something I wanted to highlight which made it a complete no-go at Datadog is there's a strong limit We discovered the hard way which is there's only so many nodes you can register with a load balancer like every provider will have its own limits Which means if you if you get higher if you have more than X nodes in your cluster You won't be able to use them to avoid this It's actually one of the only feature completely disabled in our communities clusters to make sure we'll have incidents So I mentioned that load balancer was one of the way you could get traffic between clusters Another way to do it is to use ingresses So we're back to the same setup we had before and if you want to use an ingress the way it's going to work is In addition to your workloads, you're going to have load balancers. So in that example here I used nginx or he proxy, but you have many other options and What ingress allows is when they get HTTP traffic, they can route traffic to backend services Based on the HTTP host or based on the path you use However, you can you can see here that there's a bit of a challenge because These workloads are very clever within the cluster, but we still need to get traffic to these proxies right to this ingress And the way it is usually done is to use a load balancer service right and as I told before it's not great They they actually are alternative options So if you run on on cloud providers They tend to provide native solutions to this problem and and for instance in that case What's going to happen is the load balancer instead of sending traffic to nodes is going to send traffic directly To the pod IPs right and so you have options on multiple providers the way it works You have a controller running in your cluster that is watching for services and pods and programming the load balancer So a quick summary and on ingresses the first one is they're limited to HTTP Right, so it's it's a bit of a challenge, of course By default they use load balancer services, which is ungrateful And and they are carnitive options to route directly to pods which is which are much better however, and You remember that in the very first part of the presentation I was talking about multi-cloud the fact that you run on multiple providers means the native implementation is going to be different across providers and also These implementations will have will have limits right because you need to call the API to program the load balancers every time Pod chance right if you have new pods if pods become an LC You need to call the cloud provider and say oh program this IP at the at this IP or remove this IP This can very easily trigger rate limits, which means it will take time for changes to propagate back to to the load balancer and the clients and What we have to acknowledge is Cloud load balancers are great, but they are definitely not designed for high-point shine when you reprogram them on a frequent basis So the two solutions we've discussed to expose services across clusters rely on proxies and load balancers What if we could do it without it? So there's another solution to expose services across cluster which we've used extensively and the solution is called external DNS and The way this solution works is you have a controller in your clusters that are that is also watching endpoints and services and Updating the cloud DNS entries right so in for instance cloud DNS or about 53 When a client want to connect to one of the storage pod It's going to get the IP back right so the IP that have been programmed by the controller and As you can see here, this is extremely similar to what we had with Headless services So it's very similar but there's a strong constraint here right it seems very simple on the slide You have the client but directly connected to the back end but of course because it's connecting to the IP of the back end It means the IP of the back end has to be addressable from the other cluster Which is a constraint on the way you design your communities clusters in our case We give native IPs of the underlying VPCs to pods which we achieve with Celium and this allows for the type of communications Once again, these are similarities with challenges. We had with balances because then again we have to We have to program the DNS of the provider, which means we can hit rate limits if we if we change things too fast I give a good example here, which is the propagation latency is really pretty good around a minute Right, but you can very easily get above 10 minutes as soon as you have started hitting the rate limits Which is a pretty long time once again The DNS API of providers are not designed to do this So what we wanted to say though for this solution in particular is I Mentioned the challenges we had with it. However, it got us pretty far I mean we run that for for many years probably sure four years before the limit I was starting to mention start to hit us too too hard and we had to change and I let Antoine talk about the other alternative who consider before diving into the solution. We actually put in place Okay, okay, so at this point you're probably wondering have those people heard about service matches And it turns out we we have evaluated the solution So just a quick reminder This is the example that Laurent was presenting with intake pods and storage pods I'm going to tell you what service matches how service matches work So typically you have a main container that knows how to do networking. This is the application that Developers have written and are running in Kubernetes And the idea of service matches is that you you're going to inject a sidecar container that is going to take care of pod networking needs The sidecar container will connect to a control plane that itself sources data such as endpoint data Like IP addresses from the Kubernetes API So what do we expect to gain from this pattern? So first of all, so one thing is it's independent of cloud providers So you don't have these problems with weight limits and things like this It's also transparent So in theory, you don't have to touch your application and it will just work out of the box And then it gives you a bunch of neat things like such as traffic management and load balancing So that would alleviate the problems that we saw with gRPC before The problem like the thing for us is that we actually already had invested in quite a lot of infrastructure in order to make sure that Traffic for for example for our Kafka workloads are assigned to the right pods and so on and so This would have kind of interfered with it and we didn't think we would benefit that much from those features Also, like we use we're heavy users of gRPC and gRPC has built this feature built in um another advent another upshot of Service matches is observability But as it turns out data dog already has great observability both into applications and in the infrastructure And this is not really something that we like adding a new thing in order to take care of observability You just didn't really seem like the right thing to do And the last big selling point of service matches is transparent TLS and MTLS and here again Some of our applications already had that Configured and built-in and like that would have been you would have had TLS twice Which is not really needed. So all of these are very useful features, but for us. They just seemed a bit redundant Service matches also come with some downside one of them is general debugability So when users it's great when it works, but when it doesn't work Users tend to be confused a little bit because there's an additional moving part Another big one is resource management. So at the bottom of this slide, you can see the distribution This is actually taken from our own manage cycle that we have internally that and I'm not gonna get into details here but we do use sidecars for some specific cases and it runs about on about five percent of our infrastructure and We here I captured the resource usage from all these sidecars and as you can see at the top There are some of them are using almost up to two cores Some of them are like most of them are actually like the red line at the bottom is this is a heat mat So the red line at the bottom means like most of the workloads are there. They're using zero close to zero So how you're gonna size that is tricky Unless you make service owners have to do it, which is like one more thing. They have to care about So yeah, if you if you provision for the top here like you're gonna waste a ton of course You don't want to do that. So There's also an inherent cost at Having an additional user land network up Here this is captured from from something that is supposed to advertise the linker this performance I think linker this performance are great as you can see like this is a green line The green line, sorry is still is in gray It shows that the additional latency from the sidecar is lower on linker D But really if you make me choose between the baseline linker D or is still here I I'll do my best to make the baseline work really So yeah in many scenarios, it won't matter because you don't have many hops But we tend to have large fan outs with services And so a given call to serve a user request is gonna hop many times within our infrastructure through many applications and each time you're gonna pay this They see it like we think this was not really acceptable for us so some of the service mesh vendors have acknowledged the problems that You may face with sidecars and it's being addressed in a valid variety of ways One way is to just make the proxy as white lightweight as possible. That's what linker D does Celium uses an L7 proxy on each node for example, which Reduces the overall resource consumption because it's shared through multiple points. So it's a bit cheaper But then they are concerned about isolation Here and then sizing again it is not that obvious Istio has ambient mesh that is advertising now. It's still like kind of early phase But it uses an L4 proxy on each node, which is again supposed to be very lightweight But still prone to like isolation problems and things like that And then again like also Istio with Istio you lose the L7 Features if you do that because their Zetrinal is is only doing L4 Like taking care of L4 concerns such as MTLS and things like this But not of like request routing and all these fancy fancy features So for this they actually had a waypoint proxy which adds another potentially another network up How about multicluster so This is a service machine as in a single cluster Usually what they offer you to do is to mesh clusters together So you will go in each cluster and tell them what all other clusters are And this work well is you you just have a few clusters But when you have dozens as we do it gets really difficult to manage One thing that you may have run into also with with Service matches, which we did when we actually evaluated some of them was a few years ago Was the high RAM need for sidecars because usually what they will do is load the entire mesh If you have thousands of services as we do the data plan is just gonna go crazy on memory And even CPU because it has to consent constantly handle the churn in in services So there are ways to change that but it they actually require additional configuration So yeah in in summary for service matches It provides many features, but many of them can be handled by applications The sidecars are kind of complex and they have a cost both financial cognitive and in terms of performance and Then there are sidecarless solutions But Some of them are very early stage Maybe the one to mention is the RPC proxy less which completely eliminates any proxy Which is an interesting approach actually But it's only JPC And then for multi cluster solutions are also kind of young and hard to manage So at this point you're probably wondering like if they don't use a service mesh If they don't use the built-in cube primitives, then what what do they use? Turns out we built our own So we had the following requirements When we built our service discovery solution, it had to work across clusters It had to enable direct pod discovery It had to merge endpoints from multiple clusters because we have services that actually span multiple clusters And it had to ideally require minimal application changes because we were a bit under pressure to replace external DNS that was causing problems Any other things that I didn't say is it has to be better than external DNS had at dealing with things like AP API weight limits from cloud providers Okay, so this brings us to data docs DNS system back to our two applications The way it works is that we run controllers in each Kubernetes cluster So when we install a cluster, we just install this additional controller on it It goes and registers in a central backing store, which is hosted in one of the Kubernetes clusters and just Watches endpoints from the their own Kubernetes API and will register them into into the backing store So this is a layout that we have in this data store. We just write IP addresses or associated to The namespace and service of these IPs and then we install DNS servers on the On the Kubernetes clusters and they will be able to serve queries Based on on the service name across clusters So what happens when we want to do inter-service communication here is that typically the intake pod will query DNS for the storage service DNS will return the list of IPs and then it can just directly connect to To the pods based on their IPs So one thing that is particularly useful for us is the ability to I was saying about to to merge endpoints from multiple clusters and so For example, the storage system might span multiple clusters. We do that for isolation actually for some of our systems We run the null clusters and some systems span span multiple of them And so here as you can see You can have a query for like the matrix storage that is Hasparts in both clusters and it will return all the all the ideas We also attach additional metadata to each endpoints. So here I gave the example of the availability zone which lets users actually query a Service in a given AZ. We use that in order to keep the traffic zonal and do that kind of things That can save on costs and also make reliability better in certain cases Yeah, one thing that I wanted to address is that ETCD sounds like kind of a dangerous thing to run here So the way we address that is we actually run multiple instances of this stack one per AZ and this gives us sufficient redundancy that if we have a problem with one of them for example because we want to do an upgrade or something like this We will be able to to use the others We have multiple layers of caching in in order to To serve the high traffic of DNS queries that we have So I think the most interesting one here is that we have a node local cache for all queries This is very useful because some applications like they're not all perfect and some of them can go a bit crazy on DNS queries so that at that layer we typically manage to isolate the nodes from each other and this really limits the blast radius in case one applications tries to like DOS the DNS server so the Our custom discovery system, so it's been in production for a few years now, and it's been working really well We've had no outage whatsoever And it's alleviated all the problems we had with external DNS and also enabled new use cases So if you've been following the CNCF ecosystem in the last few years, you're probably wondering like can you do better than DNS? DNS is very primitive Basically, you give it a name or is that true for a record you can do all sort of crazy things with DNS But clients usually don't support it. So you're not really gaining anything from that But in the normal case of every card you you give it a name and it will give you a bunch of IP addresses Some data planes have another service discover protocols that they're trying to standardize which is called XDS Which has a much richer data model for example Beyond the DNS Interface that we saw it can also group IPs by localities and priorities. You can do failover between them You can do all sorts of interesting things. This is actually what service matches use or not all of them But it's your at least And the envoy based one in order to do things like traffic splitting for example And then you can configure like the security policies and like a lot of things through this API But this doesn't like this is only first. This is only available for envoy and g rpc. So this is quite limited actually and Also, this doesn't address scalability at all XDS is push-based compared to DNS that is full base, but honestly like you're not gonna get a lot of Improvements based on that you're just gonna save a bit of bandwidth on control plane So if you remember the DNS design, we do use XDS actually With exactly the same layout we use we use the same data. We have a custom XDS control plane in and we use it for a couple of cases such as to Configure the envoy cycles that I talked about about before We use it to configure our edge load balancers and our internal load balancers. We have some of them in some cases they're useful and We are rolling out using it directly from g rpc clients for more advanced use cases And and we offer custom API For our internal users it kind of resembles the gateway API really we get a lot of inspiration out of it Okay, so the the solution and when presented sounds simple it sounds it sounds great But of course as you can imagine there are trade-off and the question is what's the catch? so The key thing is it's actually not that simple right because Sending all the IPs of the back end to the clients mean that the logic has to be implemented and tune into clients because defaults Will usually not do what you want right for instance if you give five IPs to some client they will only connect to the first one which is not what you do what you want to do and This means a lot of things right we mentioned before that service meshes were providing a lot of features Cluster IPs were hiding a lot of the complexity And if you do it yourself like we do it means the client has to do it right you have to support and provide observability You have to detect tell hand points you have to do load balancing in a way that is Efficient and of course I'll check the back ends because sometimes they will be discovered but they won't be reachable And if you want to do authentication and TLS you have to do that too In addition you have to do these for all the protocols you want to support right in our case We use a lot we use jrpc pretty heavily, but we also have some HTTP and if you have other Services you need to support them to Kafka for instance And finally well you have to do this in every single language you run right in our case It's mostly go Java and Python, but we're seeing more and more rust and this mean we have to implement this for all the clients in all these languages So this this is this works and we currently run this but this works because We have Antoine's team working on it and that is owning the libraries for all these languages and making sure that they do the right thing in an Efficient way and this is getting us to our conclusion right so if we want you to remember a few things from this talk In a few lines. Well, the first thing is Kubernetes native service discovery primitives work very well as long as you remain within a cluster, right? So with cluster IPs and headless services, you should be able to address most of your use cases However, as soon as you start having multiple clusters, this is much more challenging, right? I mentioned load bands for services and ingress is they will help but they They feel a bit hockey, especially if you have many clusters and high throughput applications and Of course Kubernetes primitive don't integrate with non-community services What if we want to expose services that don't run in Kubernetes? It's very difficult to use with current test primitive, of course Service meshes are promising because they want to hide to hide a lot of this complexity But of course the tradeoff is they come with their own complexity in terms of deployment the bookability and cost right? For some of our applications It will be totally impossible for us to run with a sidecar just because it could almost double the cost of running the infra So in the end we ended up building our own service discovery that is independent of of Kubernetes which is both good and bad right bad because we have to do it ourselves and And good because it means we can go we don't have limits around Kubernetes boundaries, of course, we still heavily use communities for information about endpoints and services and Finally, and this is what I was hinting at just before if this works because we heavily rely on our clients being being smart Right, so we're removing anything smart from the network and we're moving everything in the in the clients and And this is it so Thank you very much if you're interested in this topic We always hire and if you have questions for us, I think it's gonna be a bit hot Right now, maybe we can get one question and but you can reach out to us either by email or on slag And we'll be around for the rest of the day. Thank you