 So hey, Muren, we are here from DataDog. We're going to talk about our journey connecting millions of containers with GRPC. So just a few words about our company. We are multi-cloud monitoring and security products for those that don't know us. We offer metrics, logs, and traces to our users. This talk is not going to be about how to monitor GRPC using DataDog. It's going to be about how we use DataDog to power the DataDog product. So we're actually a fairly large service. We have tens of thousands of customers. We store trillions of data points per day. So that gives you an idea of the kind of challenges that we can have on the back end. And this puts in the perspective. So why do we use GRPC? So originally, we started using it fairly early on, around 2016. And the main reason to use it was client and server code generation, really based on Protobuf. It makes it really easy to basically write applications that communicate over the network. And it supported the language we needed. So the fact that it embeds advanced client-side load balancing was more of a convenient discovery than really a deliberate design choice. But as the DataDog back end become larger and larger, those features actually prove to be something that is really truly useful to us. OK, so this is a typical setup of a GRPC server service with clients at the top and servers at the bottom. Most services run on many replicas. Some services can have a lot of replicas, like hundreds of them. And we use GRPC built-in client-side load balancing to spread the load across the server instances. We usually do not rely on external load balancer in between services to actually do that work. This helps us save on costs and operations. So typically, service owners are like this approach. We use DNS for service discovery. We have no advanced kind of control playing features, although that might change in the future. So while this setup has generally been working really well for us, it has also come with some challenges. And actually using GRPC properly in this setup has been an ongoing learning journey. So at first, users at Datadoc were mostly on their own when using GRPC. And soon enough, they bumped into a set of some common problems. So the way we helped our developers to deal with those set of problems is by providing in GRPC wrappers. And here on the slide, you can see examples of some of the issues we tried to deal inside of those wrappers. And we are going to discuss some of those in more details later during the presentation. So when working on our GRPC wrappers, we try to be opinionated. So we provide a set of reasonable defaults that works for everybody. And if we need to expose some features, we usually expose them as a set of Boolean flags. Try to minimize the UX and make it easier for our users to consume it. So this approach has a nice property. So we also can transparently deal with some of the problems, even without user involvement, by adding new default behaviors to our wrappers. OK, so the first thing that we take care of in our wrapper libraries is failure detection. So one scenario that is particularly common in a setup like Datadoc is silent connection drops. So that typically happens when a host is shut down from the network. And so how does GRPC actually handle this scenario? So by default, it doesn't really do anything special. It actually relies on the Linux kernel to detect that a link has failed. And so in a typical setup, if you install any distribution of Linux and use the default, it's going to take a whole 15 minutes in order for the kernel to detect that link is that. So the amount of time you're going to get errors for GRPC server, that has a failed link. So can we do better than that? Well, GRPC actually has a feature that you need to turn on explicitly in order to make that much better. This feature is called Keep Alive. So we set that in our wrapper libraries. And we set it by defaults, basically things that in our setup, everybody should have Keep Alive enabled. So that's an example of one of the things that we set up. So this has basically completely eliminated these patterns that used to be really common, actually. And as we felt like we had a good grasp on this issue of failing nodes, we were able to bring failure detection to the next level. So as you've seen, setting Keep Alive allows us to efficiently deal with network problems. But what if network is fine, but an instance of application server is failing due to some other reason? Like for example, it cannot connect to database. It got misconfigured or some other issue is happening. And GRPC will happily keep sending requests to this failing server, which will increase the error rate for all clients that are connected to it. Here on the screen, you can see a diagram for a synthetic test that we run in our environment when one server is constantly misbehaving. And the way we deal with this problem is by enabling an newly added GRPC feature, which is called Outlier Detection. So Outlier Detection actually works like this. So it allows ejecting misbehaving server by comparing error rate for this server with the mean value, and it actually has the kind of complex configuration parameters allowing to tune the ejection procedure. But during some experiments, we came up with a set of reasonable defaults that works for most of our users. And we configure them by default in our wrapper libraries. And as you can see on the bottom diagram, by enabling Outlier Detection, the impact of misbehaving server can be completely eliminated. So now when we talked about failure detection, let's discuss some other common problems that our GRPC users faced in our environment. So here on the screen, you can see a diagram of CPU utilization that is generated for some of our GRPC server. And at first, it might look surprising why some servers will have a lot more load than the others. But the answer why this is happening by default is pretty simple. So the problem here is that by default, GRPC is using PIC first as default load balancer. And PIC first works like this. So basically, every client picks a single server out of the list of only available servers and then sends requests to this server. And statistically, the probability that the load on the server will be evenly distributed is very low. So what we did to deal with this problem is change the default in our environment. So now we use round robin as the default load balancer. And round robin works very differently. So at every client, now round robin requests between all available servers. And now every server receives exactly the same amount of requests. So you can see the impact of this change on the screen. And now requests are perfectly balanced. But in practice, we want to have even a balanced CPU utilization and not a just number of requests. So this is still not enough. OK. And so one thing to have in mind here is that service owners typically want to actually tune their autoscalers that truly the number of pods according to the vast majority of service owners actually use CPU-based autoscaling. And they would add instances when the average CPU or when the CPU of their servers is actually going over a threshold for a sustained period of time. And actually, the question is, when you have many servers, which number do you take? Typically, one obvious answer could be to take the average. But that means that if some servers are having a high CPU usage and others have low CPU usage, then some of them will still get overloaded before the autoscaler kicks in. So people typically have to provision autoscaler for the instances that are the worst performing. So here is a graph of usage for one of our production services. As you can see on the top, because we use round robin, the requests are perfectly balanced. But on the bottom graph, we see that CPU usage is actually showing this bend effect where some are around 50% of CPU usage, or others around 60%. So this is a problem in practice because you have to tune your autoscaler based on the more like those that seem to have higher CPU usage. So what's going on under the hood here? It happens that our round robin is good at balancing requests. It's not good at actually balancing load. And here we have our workloads running on two different generations of CPU. So ideally, you would want CPU usage to be perfectly balanced and not the number of requests. And one thing to keep in mind is that the team that is responsible for scheduling workloads onto servers isn't really able to make sure that all the servers for a given deployment are running on the same kind of instances. So can we do better than that? Well, some of you that have worked on the GRPC very recently will recognize the perfect tools for the weighted round robin balancer, which is a new feature that was recently added in GRPC. And the idea is that servers will communicate their current load in each response. And then clients will weight the number of requests sent to each server according to a computed capacity. So integrating that in our environment was actually really easy. We just had to implement a few interceptors and a time loop that measures CPU usage and plug that into the built-in weighted round robin load balancer. So we do that through our wrapper's library again. So this can be enabled by service owners through a simple booting. So the results have been actually very impressive. As you can see on the top graph, CPU usage now after deploying the load reports becomes completely balanced. It becomes a straight line. And it's now the number of requests per pod that is actually different according to the server performance. So the nice thing really about this feature is that it requires no coordination, no control plane. It's really easy to set up. OK, so as we just showed, round robin is a great way to achieve good load balancing. It actually has also some drawbacks. So let's look into that. So each arrow on this diagram is actually showing a connection between a client and a server. When we use round robin, this client-side load balancing kind of setup, the total number of connections is the number of clients multiplied by the number of servers. So each server receives one connection per client. This number can become quite large if you have a lot of clients. And so the question is, is that a problem in practice? It ends up then it is because those connections are cheap. But when we have thousands of connections, they all sum up. And resources that are consumed by those connections add up as well. So here on the screen, you can see a screenshot of data.profiler applied to one of our GoJRPC servers. And as you can see, a lot of memory, like almost 2 gigabytes, are spent on some internal GRPC buffers. So we dig deeper to investigate what's going on there. So in order to understand the problem, let's consider a typical GRPC setup for a GoLang server. So here you can see a connection. And every connection in GRPC Go allocates two buffers, one for reads and one for writes. And GRPC Go uses those buffers to proxy request on the line network socket. The corresponding sizes of them are 32 and 64k. And actually, those buffers help to improve performance because accessing network socket is expensive because it requires network calls. And accessing data in memory is much faster. So by using those buffers, we can improve performance. But let's consider what happens if we use round robin and now we have thousands of connections. So just those two buffers can account for a gigabytes of memory on every server, which account for terabytes of memory across all our environment. So one more problem here is that in round robin case, the effect of those buffers is not that visible because a lot of the connections are not heavily utilized. Like we have thousands of connections, but a lot of them are mostly idle. And buffers just sit there without helping us a lot. So what we did, we worked with GRPC team and introduced an optional mechanism that allows to share buffers between connections. So we use sync pool, which is a goal and abstractions for sharing objects between concurrent goal requests. And we wrote some logic to release the buffers and make them available for other connections when necessary. So after we tested this feature on one of our GRPC servers, the results were really good. So we saw a 40% memory decrease with no visible CPU impact. And if you want to learn more about this work, here is a link to the pull request where we did this work and provide some benchmark results. But actually here, I must mention that GoLang is not the only language where we have problems with too many connections generated by round robin. OK, so this is, again, the same diagram just showing the connections for one client. And here, we are talking about Python. So one thing that we quickly discovered when we investigated these problems of too many connections kind of causing too much resource usage that the reality doesn't really look like this, it actually looks like this. So each Python process looks like it's opening just one connection to each server but a bunch of them. So why is that? Well, a lot of popular Python frameworks are actually spawning not just one process to process requests, but it's spawning a bunch of them. And they cannot really share a GRPC runtime. They cannot really share connections. So in our environment, typically, the servers that are serving our large Python monolithic application, they run thousands of processes, which means that in this case, we will have dozens of connections per client to each server. So the way we improve the situation here is simply by running an external connection puller. We use an end-by-proxy for that running as a cycle that the Python process is connected to and that is then responsible for doing the round robin for us. So by doing that, we were able to make Python on par with other languages, but we didn't really actually solve the problem. We just greatly improved the number of connections in the case of Python. So can we do better than that? So the thing that we think should solve the root cause of too many connection problem is subsetting. And the idea behind subsetting is really simple. So what we are going to do, we will make every client to choose a subset out of all available servers and then round robin request between them. So a useful way to think about subsetting is this. You can think that it's something that's in the middle of two extreme, which are pick first and round robin. If you compare it to pick first, it still opens more connections because every client is connected to a subset, not to a single server, but it's still way less in case of round robin. But because of the fact that we are opening more connections, we can have a better load utilization on the server than we have with pick first. So the only way of subsetting we implemented in our infrastructure so far is random subsetting. And the algorithm here is trivial. Basically, we just pick a random set of hosts on every client and then round robin request between them. But the results, when we applied it to one of our servers, and this particular server has like thousands of clients. So as you can see, the number of connection as well as memory utilization on the server reduced a lot. But the CPU impact was not that great. That's, once again, because random subsetting has exactly the same problem as pure pick first. So the imbalance between the most and the least utilized server now grows, and service owners will have to account for that when allocating resources for their server. But once again, it might be a fair trade-off for some type of applications. But still, it doesn't feel like it's a generic solution we can enable by default for everybody. So the thing we are doing right now, we are trying to closely work with the GRPC maintainers and try to introduce a standard way of doing subsetting in GRPC by exploring smarter algorithms of choosing subsets on the client, as well as the algorithms how we can distribute load when subset is already chosen. But this is still work in progress. You can see a link to the corresponding GRPC proposal on the screen. And it's too early to share any results of this work yet. That's what's it. We'll be happy to answer any questions. Yeah, go ahead. Thank you, Mike. Please. I'm wondering, are you expecting that with random subsetting being taken out and deterministic subsetting coming in that the CPU load will be better balanced across servers? Is that the expected outcome? So yes, basically with deterministic subsets, there are some well-established algorithms. We looked at the ones that used to Google and the Twitter aperture. And actually, it is possible to achieve perfect request and connection distribution if you use deterministic subsets. The problem with those algorithms is that they require coordination between client. And that's kind of the point where we get some pushback. So that's why we're exploring different options of doing this. And we may end up not contributing the deterministic subsets into GRPC. I know some approaches that work without coordination. Please come and talk to us. Just to follow up, it may be a naive question, but sharing buffers between the connections like you did for Go is not a possible solution for the Python implementation. I don't think that Python is affected by the same problem because it's more specific how Go was implemented. And I don't want to dig into too much technical details, but I don't think it applies to Python. So we use subsetting at our company. And we leave it to the teams to decide what the subset size should be. I'm curious if you guys have any data on how you decide the subset based on number of clients and servers. So we have a few teams, as we said, that do that. And they choose the subset size, so it's the same. For us, it's a little cumbersome because we don't have a centralized way to control all clients. As I mentioned at the beginning, we don't have any control planes that can dynamically inform all clients that they should increase or which subset size they should use. But yeah, basically this is like. So is this some kind of like configuration that the team needs to change and then roll out to all the clients? Yeah, it does. You need to pass this information at client creation. We also look at options where we can somehow calculate the optimal subset size based on some metrics, but I'm not sure if it's feasible. So in case of weighted round robin, you said the load was reported back to the client, either out of band or along with the RPC. So in your case, what was it? Out of band? So we use the default, which is within response rate also. I think both would have worked equally well. OK, because if it was reported out of band or something, you could actually also use that to figure out which subset to choose or where to send RPCs and all that stuff. Even pick first can be used with that, I was thinking. Yeah, we're exploring those options like trying to disconnect from most connected servers. But yeah, as I said, we don't have results yet. I have a question for you guys. So going back to setting up keep alive parameters on the server, I'm wondering if in your services are those turned on by default? Or if so, how do you determine the default values for those? So actually, it's pretty interesting. I didn't get into detail here, but we set it to a pretty high value, because the main thing, in my opinion, that GRPC's keep alive are doing. So the mechanism is that it sends ping regularly, and you can configure the interval and the timeout. But regardless of the interval that you put, it sets the CPC user timeout on the subject. And this is what has basically reduces this window of 15 minutes to 15 seconds, which in most cases is not the target. Yeah, and the actual value, I think, is still 15 minutes. But we don't care about the keep alive themselves. Thank you. And just to mention, historically, I don't think to sort of get back on the beginning of the presentation, people have been kind of tuning this keep alive parameter on crazy things, and setting them in ways that actually cause more problems that are solved. So that's one of the reasons why centralizing that. After the initial subset is computed, and then the outer detection evokes the host, what happens with the subset? So in our setup, nothing, it just reduces the size of the subset. So outlier detection, without my detection, no risk of actually removing all the servers of the subset, because you're only looking at the subset and the edge of the outlayers. So all servers cannot be outlier. Yeah, and there is a protection you can configure. Max ejected host, I think it's the default to 10%. So you will never eject more than 10% of your server.