 Thank you for attending our talk today. Service Mesh, a hole in the pocket. A little bit about your presenters today. My name is John Murray. I am an engineer on a service networking team at Stripe, building our internal service mesh. I am an occasional envoy contributor when the need arises. And outside of work, I am a C++ enthusiast. And I'm Venil Narona. I work on the Service Mesh platform at Stripe. I've been a contributor in the Istio and Envoy communities. And I really enjoy distributed systems. Okay, so let's jump into the promises of Service Mesh. First is a simple application networking. Features like load balancing, circuit breaking, retries. And the Service Mesh is going to provide all these teachers in a way that's really easy to take advantage of for application owners. It may be simple as specifying a few headers onto the quest. And this works across different languages and frameworks. Next is traffic patterns. So you have like things like degree and deployments, traffic splitting, traffic shaping, fault injection. These are similar to kind of the application networking features before, but they're more organizational-wide, right? So whether it's for reliability or for like advanced deployment scenarios, these are kind of features you're going to want as your company scales. Security, which for most people is going to look something like Mutual LS or sorry, MTLS and role-based access controls. And then lastly, observability, right? Metrics, access logs, distributed tracing. And as your organization grows, you're going to want a lot of these features, a lot of these controls. You're going to want uniform interfaces for all of these things. And the Service Mesh kind of solves this, especially given that as a company grows, we typically see more languages, more frameworks kind of being used, whether it's through evolving development practices internally, or maybe it's through something like an acquisition. And so having a Service Mesh unifies all of these interfaces to kind of one access point, both for control and development. Okay, so what are the costs of running a Service Mesh? At a high level, there's explicit costs. So this is the costs that we typically think about upfront. When we're doing a research and we're deciding on whether we want to adopt a Service Mesh or not, there's the hidden costs, right? The costs that you're technically not going to think about before adopting. You're usually going to find this after you've adopted, implemented, and kind of ran with the Service Mesh for a while. There's integration costs, right? Which is what does it cost to kind of hook my Service Mesh up to other third-party services? We'll see what those are in a little bit. Developer costs. You know, it's not necessarily free for developers to build on top of a Service Mesh. So what does it take to provide that level of education and productivity that you want to deliver to your developers for them to be effective? And lastly, support costs, right? The Service Mesh isn't going to run itself. So what is the cost to maintain or operate a Service Mesh? Okay, let's jump into the explicit costs. First is CPU, right? And under here, I've misconfigured quote, right? Because it's sometimes too easy to maybe provide too much CPU and you're just kind of burning money or you're probably too little. Maybe you're impacting kind of your latency overhead. The same kind of goes for memory where you can allocate too much and you're just kind of burning money and allocate too little and you may be impacting the performance of your applications. Latency, misconfigure concurrency. So, you know, the different ways you can configure your Service Mesh and how much resources it takes up can also impact kind of that latency. Yeah, so these are the explicit costs. These are things we think about, right? Like how much money is it going to cost to actually run this new code on all of my machines? And what is the latency I'm going to expect to add to all of my requests that are all the requests that are flowing through Service Mesh? Okay, so now let's look at the hidden costs. First up, network bandwidth research. Particularly thinking about control plane traffic. So the configuration data you ship from your control plane to your envoys, depending on your fleet size, maybe fairly large, maybe in the order of megabytes if you're operating a significantly large fleet, maybe gigabytes, and depending on your topology, you may be shipping this data around to thousands, tens of thousands, maybe hundreds of thousands of nodes, right? And so that's all kind of network bandwidth usage that you have to kind of add it to your costs now. Next is IO costs. So the Service Mesh makes a lot of things implicit and maybe previously more explicit. So an example here might be availability zones, whereas before maybe your code accounted for crossing those availability zone regions. With a Service Mesh, you have to be careful in how you configure it because you may be crossing those boundaries without knowing it and paying those additional costs. There's also like feature specific costs, right? Health checks, especially active health checking is sending more data across the wire crosser network to check all of the health of the upstreams. This can kind of turn into a star pattern sometimes with Service Meshes. So you have to be careful about how much traffic you're generating just health checks. And other features like hedging and retries, these are reliability features. These are probably features we want, but we need to understand that they're not necessarily free and we are going to be generating more network costs by using them. Okay, so next up is integration costs. So we can think first with like metrics, right? The Service Mesh is what generates a ton of metrics. It generates metrics that are uniform across a bunch of types of locations. So they're highly useful for debugging your system, understanding your system. However, sometimes they can generate a rather insane amount of metrics, probably more than you need. And you're going to have to think about the costs associated with storing those and the vendor costs, right? So if you're up to the cloud, it's kind of your vendor cost. Even if you're running your own internal metric system like a Prometheus store, you're going to think about all the Prometheus clusters that you're going to have to run to support the metrics coming from your Service Mesh. Log storage is the same way. Service Mesh can produce uniform access logging, which can be super useful for debugging and troubleshooting heterogeneous systems. But again, it can get very costly very quickly and you can be generating terabytes or more of data every day, every hour. So it is definitely something to consider. And in the same vein as log storage is distributed tracing, right? You're going to be generating a lot of different spans. So you need to think upfront if that's a feature you want to use, how much you're going to allocate for storage costs. Okay, so developer costs. This mainly boils down to education. So you're going to need to kind of train your developers on like how to debug troubleshoots or errors that may arise from the Service Mesh or use of the Service Mesh and then kind of teach them code patterns to use and avoid rights. This is kind of really understanding the interfaces of the Service Mesh and how to work with it. The Mesh is not always as transparent as it may be marketed to be or as we may hope it to be. So education is important. And lastly, just kind of managing resource quotas, right? Especially if you're running in like a Kubernetes type system with maybe like an ongoing sidecar as an example, then your developers are going to need to know how to properly set quotas for the Service Mesh and what the trade-off is there between cost and maybe performance, right? All right, so support costs. First up is kind of maintenance, right? And this may be something like CVE patches. And these things you have to do, you have to stay secure. So you need to spend the time to patch your Service Mesh with a lot of other fixes. You may have other patches as well. They're just kind of like bug-fixed patches that you may find. Service Mesh typically represents a lot of different traffic patterns and not all of those patterns have equal coverage from your upstream like provider of that open-source software. So you may need to find your fixing stuff yourself sometimes. API compatibility and upgrades. So looking at the Envoy ecosystem as an example, the XDS, the migration from V2 to V3 involved a lot of work. And potentially down the road, a V3 to V4 migration in that configuration language will also involve a fair bit of work, which kind of scales as your fleet size grows. Operations, data plan upgrades, right? If you're deploying a mesh in a sidecar pattern, this may just be a lot of nodes to update. A lot of time spent doing those upgrades. Control plan upgrades. They may fall into the same boat. You may be using a central control plan, in which case your upgrades are faster, but they're also much riskier, more dangerous. So there's just additional time taken to make sure that you get that right. And things like sort of rotations, like these things, all these things just kind of take time. Troubleshooting is the next big one. We talked about developer education and how that's important for users to understand how troubleshoot errors and work with a service mesh, but service mesh limitations can just be a lot. They're built very generally to serve a lot for abuse cases. And so when there's a critical service having issues, it can be really hard to tell them, go read this giant body of documentation and you'll be able to figure out your own problem. Sometimes you just have to jump in, get things fixed, get things up and running. So your team operating service mesh will probably spend a fair amount of their time troubleshooting. And lastly, ownership of client libraries. Again, in developer education, we talked about patterns and practices when working with a service mesh. You may decide that it's easier to take those patterns and practices and implement them inside of a client library that you distribute to your users. But again, this is another form or another way of cost for that particular issue that you should consider. So that wraps us up for costs. I'm going to hand it off to Vanille now to talk about strategies for controlling cost. Thank you, John. Let's now have a look at some strategies for controlling costs. I will first give you a overview of different strategies and then I will dive into each one. The first strategy is to understand the force. Service meshes and proxies come with default configurations which can affect the proxy's behavior of routing requests or the amount of data generated by these service meshes. Therefore, it becomes important to understand these defaults in order to control costs. The next strategy is to sample access logs. Depending on the amount of traffic flowing through a system, your service mesh can generate a large number of access logs. This can quickly increase your storage costs and having a good sampling strategy for your access logs can help you cut down these costs. Similar to that is metrics. Because service mesh proxies can generate a large number of stats, it can quickly overwhelm your metric systems. Therefore, it becomes important to simplify metrics in order to reduce costs. The final strategy here is to perform AZ over routing. AZ stands for Availability Zones. Cloud vendors typically have policies for networking and IO and when you request cross AZ borders, it's going to cost you. It becomes then important to identify services that are routing across AZs and also ways to minimize cross AZ traffic. Let's now have a look at some details. Envoy, for example, has default HTTP to configuration. For example, you can configure max concurrent streams or you can leave it at the default. You can also configure Envoy's concurrency, which basically is the number of worker threads that Envoy must use for request processing. Envoy also comes with default retry and circuit breaker configurations, which affects the way requests are routed through a system. By default, Envoy generates a lot of stats, so it becomes important for you to figure out which of these are meaningful and which ones to discard. If you, for example, misconfigure Envoy's concurrency or HTTP to configuration, it can degrade performance, thereby increasing your costs. If you leave circuit breaker or retry configurations to default, it can affect the way your requests are routed through your system in a failure scenario, and that can increase your network usage, thereby increasing costs. And as I said, metrics by default can overwhelm your metric system. So understanding the default behavior of your service measures or proxies is important to cut down costs associated with metrics. You typically see a lot of 2XX responses in your system. These are not very interesting. So logging them can definitely overwhelm your system, but not provide you good insights. It therefore becomes important to identify what is your use case for access logs. In some cases, you may want to use access logs for incident remediation, but in other cases, you may want to use this for long-term learning. For example, to visualize a service graph of all the components in your system. You can also think of sampling or filtering access logs depending on your need. In the case of Stripe, we completely discard all 2XX responses, and we also discard all responses associated with health checks. By that, I mean we do not log any of these kind of responses. Service mesh proxies generate a large number of stats, and they can quickly overwhelm your metric storage system. That basically means that you will be spending a lot of money on your metric storage system. These metrics can also increase cognitive load because these come from disparate sources. So it becomes important to simplify observability for service owners to easily understand the state of the system using metrics. You can think of simplifying metrics by either filtering out irrelevant metrics, or you can also think of combining metrics from different sources in order to provide a unified view so that it's easier to understand them, and at the same time, you do not store all of these metrics, which can increase your costs. At Stripe, if you do not control the way metrics are generated, it has the capacity to quickly consume a majority of our metrics budget. Finally, let's have a look at how ASEAware routing can increase your spend. Cloud vendors charge for traffic that crosses easy borders. So you need to identify with services really route across ASEAs. This could also be just the control thing communication itself. Once you identify with services route across ASEAs, it becomes then easier to identify whether that is really required or you can cut that down. This is one of the benefits of using a service mesh. That is, now you can control services and their routing behaviors so that they can prefer local services versus ones that are routing, that are running remotely. You need to take care though that if there is a local outage, your service is still up and running. That is by preferring a remote service over a failing local service. And for that, you may want to implement health checks or you may want to employ outlier detection. Let's now have a look at some open problems. The first one is troubleshooting. When you deploy your application on a service mesh, it becomes a little difficult to figure out where problems actually exist. Whether the problem is with the application itself or whether it is with the network or whether it is with the service mesh is a little hard to tell. Therefore, you may want to spend some efforts on building tooling to aid troubleshooting. One thing that comes to mind here is to hand over the ownership of client libraries to the networking team. By doing that, the networking team can instrument these client libraries to aid troubleshooting. But at the same time, it becomes difficult to figure out where to draw the line between this team owning network interfaces versus business abstractions. Another important aspect is developer execution. What can we do to educate developers about service mesh behavior? Here we are talking about application developers. Can service mesh is the 100% transparent to service owners or do we need to build higher level abstractions or write huge documentation in order to educate developers? Finally, another important point is maintenance. How can we simplify upgrades? Onward releases are typically done every three months and there are CVE patches that come in between. Deploying these to a large scale system can be time consuming. Also, when there is an API upgrade with an onward, it can become difficult to migrate between one API to another. How can we minimize this friction? Also, hot restart may not be available across different onward versions. How do we ensure that upgrades are smooth? These kind of questions are still open. In summary, we had a look at the promises of a service mesh, costs of running a service mesh, some strategies for controlling costs, some open problems, and now a big question remains, is it worth it? We do think that service mesh can solve interesting challenges for you, but you need to be very intentional about what features you require from a service mesh. You need to understand what features to enable, what are its defaults, what sort of logs and metrics can these features generate, and how much of your budget can it consume? With that, we would like to end the talk. Thank you. Stripe is hiring, so feel free to reach out to us if you are interested. And now we are open for questions.