 Hello everyone, welcome. I'm John Howard. I'm a software engineer at Google and I've been working on Istio for almost five years now. Throughout that time I've worked on a few different areas but one of the things I've been most passionate about is working on the performance and scalability of Istio. I'm really excited to be here talking about how we have architected ambient mesh for scalability. What we're going to be going over today is a bit of an overview of what is scalability. Might seem obvious but it's worth giving a review at least. We're also going to be talking about why scalability can be challenging for a product like Istio or even similar networking products like Kubernetes and others. How Istio is addressing these concerns in ambient mesh in the waypoint component, the z-tunnel component, and in the control plan. If you have no idea what I'm talking about right now, don't worry. I'm going to cover what these are. Finally, we're just going to have an overview of what it all looks like together and what kind of scalability you can expect. Finally, some next steps. When we talk about scalability, the way I view it is it's about how large we can go before we start hitting limits. How large is a large number of factors? It could be how many pods, how many nodes, how many clusters, how often those are changing, how large each of them is. There's many, many factors. Scalability is never a simple problem. We want to look at it like I can have 100 pods or 200 pods or something like that but it's really this multi-dimensional problem. When we talk about limits, there's also a lot of limits. These could be cost of complexity to manage. It could be resource utilization. These could be stability limits that we hit when we hit the large enough scales, performance issues, things that start crashing, etc. We want to make sure that we can go as large as we can in all these areas without hitting these limits. In general, I would say the goal that I have with Istio is that Istio is not the bottleneck. We don't need to scale to a billion pods if no one's running a billion pod cluster. If someone wants to run, say, 100,000 pod cluster, we don't want to be the thing that's blocking them from doing that. Before we go on, I want to take a little poll from the audience. How many people here are running clusters with over 1,000 pods? Wow, quite a few people actually. What about over 10,000 pods? Wow, that's more than I expected. We've got about maybe five hands. I suspect there's more who don't like to share. I suspect that there's more if we asked around at BroaderCubeCon and not at Istio today because we are probably currently not 100% meeting this goal that Istio is not the bottleneck. That's why we're talking about improvements we're making so that hopefully we can achieve that goal in the future. If we take a step back and look at why is scalability for Istio challenging, if you look at the internet, the broader internet model, you have something that looks like this. Your laptop only knows really how to talk to the router, and your router probably only knows how to talk to the ISP. At that point, they know all the routing to get to anywhere in the world, but it's not like my phone knows every IP address of every device in the planet. It would not simply fit on my phone. But if we look at networking products like Istio and even Kubernetes here, it's quite a bit different actually. In Kubernetes, every node knows how to talk to every other node in every other service. You have this mesh. Istio is the same way. I use this new diagraming tool and it gave me this mess, but I think it's fitting because this is what it looks like. Everything needs to know about everything. If you look at that, it's a classic and squared problem. That makes it hard to scale. In Istio specifically, here we have an example where we have four different workloads, so four different envoy proxies, and each one's color coded just to show that they're different. You can see each one has configuration, these little circles, for all the other workloads. Everything needs to know about everything. Again, this is a very classic, does not scale and squared problem. How do we fix that? Actually, before we get it fixing it, there's more problems too. If you have n squared, you better hope that you have a small amount of n so that it can scale at least a little bit before you hit the cliff. In Istio, that's not the case. The config is both scaling at this n squared rate and it's also massive to begin with. This is a dump of a pod's configuration. You may or may not recognize something like this. This was from an empty cluster, and it scrolls for quite a while before you get to the bottom. You can imagine that when a small-scale cluster is having that much configuration, how poorly the scale is at an n squared rate. That's two problems that we're going to be tackling. The scaling rate and even just the baseline size of how much configuration we're talking about. But scalability is not just about how much configuration there is. Configuration is always changing. Pods are scaling up and down. You're changing your virtual services to do a canary rollout, things like this. It's also about how we're pushing those updates to the proxies. Because any pod scaling up or down may potentially need to send out an update to every other pod in the entire mesh, which for many people here is 10,000 proxies or more, it's really important that the updates are efficient. In general, there's a few different ways that you can go about this, really in general, but also specifically in Envoy. One would be what we call state of the world. I would consider this the least efficient, and as we go on to the most efficient. In the state of the world, anytime a resource changes, the entire state of the world, hence the name, is sent to the proxies. If I change my service A, now I'm going to send the entire set of all the services down to all the proxies. It was a big snapshot. There can be a lot of redundant information that's sent repeatedly as you make a lot of changes. An improvement on that is a more delta or incremental approach. As a configuration changes, we just send the one that changed. This may sound perfect and like it solves all the problems, but you may still run at issues if the actual configuration objects are still huge. This is a problem that Kubernetes ran into. They have a model that's very similar to Delta with their watching model, and they had this endpoints object. The endpoints lists all the pods for a single service, and many services could have a thousand pods. These objects, even though they were sent incrementally, were giant. They had to introduce a new resource, Endpoint Slice, which splits this into chunks to make this more efficient. Lastly, an improvement you can make in the incremental approach is an on-demand approach. Rather than sending the entire set of all resources and incrementally updating that, we could also lazy load these resources, kind of like a serverless model, where it's scaled up only on-demand. If we look at where various products line up here, I think most traditional proxies are in the state of the world. Oftentimes, to even change them, you need to do a full restart of the proxy. Istio straddles the line of state of the world in Delta. We do a bit of both today. Kubernetes, like I said, they have a more delta approach for their watches. The closest thing I could find to an on-demand would be DNS. When you make a request to some service, you do a DNS request at the time of making the request. The one difference I would say is that when we're talking about proxy configuration, typically it's kind of a PubSub mechanism. Once we start requesting a service, we would get updates for it. DNS doesn't have this pushing mechanism, so it's slightly different. Now that we have a background on what we're trying to solve with scalability, what our constraints are, I do want to give a quick overview of ambient. If you haven't seen this slide yet, it will be the first of many today and this week, I'm sure. This is kind of the new architecture for ambient mode. The big difference, if you're not familiar with this, is that instead of sidecars, we have these two different components. We have one, the Z-tunnel. This is a per-node component, and it has a much smaller responsibility than the sidecars. It's real job is to get traffic from point A to point B with secure encryption and policies. It's not doing the full can of features that Istio provides, right? It's not doing HTTP telemetry, routing, JOT authentication, all these other things I'm sure we'll hear many people talk about. For that, we introduce the Waypoint, which has all the same functionality that we have in the sidecars or in Gris Gateway, but is decoupled from the applications. First, we're going to focus on how we made Z-tunnel scale. Like I said, Z-tunnel is running on each node. It needs to know how to route to all the other nodes or all the other workloads. This still is an n-squared problem. We haven't solved the fact that it's n-squared. However, it does have a much smaller surface area of configuration because of its reduced responsibilities. Instead of tackling the n-squared problem, first, we target making it a lot more efficient. Before I showed the giant envoy configuration, this is what configuration looks like for Z-tunnel. For each pod in the cluster, you would have a snippet of something like this in Z-tunnel. You can see it's fairly small. It's easy to understand. The key thing here is not that the eStudio developers make better APIs than envoy developers. That's not true. The issue is that envoy is a generic configuration API for a large set of use cases. If I want to tell envoy to use mutual TLS, I need to first tell it to use TLS. I need to tell it what CAs to trust, how to verify the identities, what Cypher suites to use, the TLS version, and that's just one example. As we are more and more opinionated about how eStudios should behave, we have to tell envoy more and more things, and this blows up quite a bit. With eStudio, because we have an eStudio control plan and an eStudio data plan, we can be a lot more opinionated in the configuration protocol. Here, instead of the giant set of things I would need to tell envoy to do MTS, I can send a single bit to Ztunnel that says use protocol MTLS, and the details of what it means to do MTLS can be built into Ztunnel, so it doesn't need to be transmitted on the wire for every single workload every single time there's an update. Just doing this alone, will it scale to 10 million pods? No, it's still an n squared problem. It's actually not exactly n squared because now the proxies are per node instead of per workload, but your node scale with the workload, so it's close enough, right? But this does get us to a very, very large scale. We've done testing 50,000, 100,000 pods in the footprint of Ztunnel is fairly small, like 30, 60 megabytes of RAM. But we do think that while there may not be 100,000 pod users of eStudio or there are few and far between, we want to get there eventually where people are running their large, large clusters on eStudio. So we wanted to future-proof this as well. So while the baseline of this, we have the efficient protocol, which is also incrementally updated. As I mentioned in the previous slide, it's using the Delta protocol. We also are working on an on-demand approach. So what this would look like is your service would do send a curl request to some service. The Ztunnel would get that request and say, I don't know what the service is, so it will do the on-demand lookup to the control plane. The control plane will return everything it knows about the service. Future requests, we don't need that lookup anymore, and so those will go without any lookup. And the difference, like I mentioned earlier from DNS, is that the control plane can spontaneously push updates so that we're always having up-to-date information. So with this on-demand approach, it really unlocks, I don't want to say unlimited scalability, but it's not far off. We don't have to load the entire set of the world into every Ztunnel. If you have a node that's only calling one service, it only needs information about that one service. That allows the footprint of Ztunnel to be extremely low. Another aspect of Ztunnel is that it's purpose-built. I mentioned this for the protocol as well, but it actually applies everywhere throughout Ztunnel. We built Ztunnel from the ground up to serve this specific purpose. And in doing so, at every step of the way, any decisions we made were the ones that took us closer to the architecture and requirements for Ztunnel. So it's hard to put slides for saying that we built it intentionally. One aspect is it is written in Rust. Rust is not why it's scalable, but they have a nice logo to include here. So I put it up here. But to give one concrete example, in Envoy, they have a model where things are not very shared between threads. Connection pools, for example, are not shared between threads. So if I'm running three threads in Envoy, and I make three connections, they are probably not going to be shared at all. So I'm going to have three M2Ls handshakes, three connections open. And there's just not much, you know, we can't really change that in Envoy if we choose to, maybe with a lot of work. In Ztunnel, we decided that we actually wanted to use the shared threading model. Because we built it ourselves, we were able to look at that choice, evaluate it, and make the correct choice for our requirements. So these slides I stole shamelessly from Cloudflare. They had a blog post where they talked about the exact same thing moving from Nginx, which has a similar threading model as Envoy, to I think it's called Pingora, which is some proprietary proxy that they made for very, very similar reasons that also happens to be rich in rust. So if you want to learn more, you can check out their blog. It's pretty good. It's basically this exact talk in Cloudflare perspective instead of Istio. So we're not alone in the industry on these trends. So with all those together, we're pretty, I'm pretty happy with where Ztunnel scaling, to without the on-demand, we can easily scale to very, very large clusters. And we have kind of this on-demand mode in our back pocket for when we need to scale to kind of hyperscale level. So next I want to move on to waypoints, which have very different properties than Ztunnel. So in a waypoint, these are running per namespace. I'd like to call them kind of the gateway to the namespace. These are Envoy based. They had the full set of functionality that these just sidecars or Ingress gateways have. And with that comes a lot of configuration. This is both strictly required configuration because we're configuring all this different functionality and a little bit of bloat because it's Envoy, but a lot of it is necessary. Like I said, complex configuration surface. The key point though is that because these proxies are only sending traffic to one namespace, it's actually very scoped where the configuration needs to go. So if we bring up that old sidecar model, we have this n squared, right? For the waypoint proxies, they are fundamentally different than sidecars, right? In the sidecars, the clients have all the routing information. Each client sidecar needs to know how to reach every destination. But in waypoint proxies, the clients only need to know how to reach the waypoint for that namespace. And the namespace waypoint only needs to know how to reach workloads within that namespace. So we've kind of moved the client side routing behavior to the server side and limited its scope. So if we look at this in terms of how much configuration there is, this is obviously huge approximation. But on the left we have 16 configurations and we have four on the right. So we fundamentally broken that n squared scaling problem. If we ramp this up just a little bit and consider, you know, an example with 25 services, 10 pods each, two namespaces, we get about 1% of the configuration needed to program the waypoints as we do with Envoy. So, you know, we took two different approaches here. With Z tunnel, we still had this n squared requirement. So we had to make it very efficient to do that. With waypoints though, we can keep some of the inefficiencies because the scope is fundamentally different. So that's not something that, you know, we can just go optimize after the fact. We fundamentally designed ambient around scaling like this. Last thing I'll go briefly here on the control plane. You know, we talked about these incremental updates. So one of the things that's challenging with incremental upgrades or updates is figuring out what actually changed so that we're not either pushing redundant changes or even worse, we're failing to push an update when we should have and reserving still data. It turns out this is actually pretty hard to do correctly. So we've been working on a new framework that kind of helps to do this automatically. So that developers don't have to do it kind of the manual way, which leads to a lot of errors. Unfortunately, I don't have a lot of time to talk about it. But fortunately, I have a whole talk dedicated to this tomorrow. So if you're interested, come find me in some room that I didn't put on there. But you can check out the talk building better controllers tomorrow. All right. So if we, if we look at this holistically, what's the actual benefit here, right? Are we talking, you know, 1% savings? What's, what's happening? You know, any, any scaling and benchmark statistics, you can lie quite a bit. So I like to lead with the, the extremely deceptive numbers. Okay. This is not realistic. I just full caveat. If we just compare a giant cluster though with Z tunnels versus side cars, and this is a real test that I did. It's not just extrapolated numbers. We can see each envoy is going to use four gigabytes of RAM to hold all the configuration while Z tunnel needs only about 75 megabytes. So if we, you know, total this up across the cluster, the side cars take 150 terabytes of RAM and Z tunnels only need 12. So that's what 13,000 times improvement. This is like buying a dinner versus buying a house in terms of the cost of RAM per month. Now the thing is, this isn't quite accurate, right? No one in production is running side cars that take four gigabytes of RAM. Right. We have a lot of knobs in East Joe that you can optimize the resource footprint of side cars. However, you have to go do that, right? Z tunnel doesn't require any of this configuration, right? It's scalable out of the box. The other thing that makes us wildly unfair is that we're comparing Z tunnel to side cars. Side cars are this fully functional service mesh proxy, right? Z tunnels only have quite a small scope. However, if you are only using the small scope that Z tunnel provides, maybe it is a fair comparison, right? Maybe you're paying for more resources than you need because the side cars are giving you more functionality than, than you need. So one of the nice things about ambient is that you can pick and choose. You know, maybe for some services, maybe most services, I only need MTLS. I only need the Z tunnel, which is very cheap, but some other services, I want some rich H2B functionality so I can enable the waypoint on those ones only. You don't really have that option in side cars. It's kind of all or nothing. And you really need the entire mesh or the entire cluster to be in the mesh to get the full benefits. So with ambient, one of the nice things is you kind of get a pick and choose how you scale. So the, the deceptive numbers are fun, but a more realistic model would look kind of like this. I believe this was a test we did running a bunch of the demo apps, book info, you know, some of the other ones, a couple replicas of this. So we had about 160 pods. I'm not sure how many services, probably 20 services or so. The envoys were using about 40 megabytes of RAM, each pretty reasonable standard number, tolling six gigabytes. Z tunnel is quite small, three of them only 15 megabytes. And then waypoints, like I said, they have a very similar set of configuration as a side car. So at this small scale, they look quite a bit like the side cars, but we do have less of them because they're not one per workload, right? They're per name space. So overall in most tests we're doing in comparing side car cost to ambient, we're seeing somewhere on the range of 90 to 99% resource savings. So is it this really fun house versus dinner comparison? Probably not, but 90 to 99% is, I would say nothing to scoff at. So if we move kind of beyond where we are today, like 99%, it's nice, but can we do better? What about 100%? So the architecture today looks something like this, right? We have the per node Z tunnel and the per name space waypoint proxies. But you'll notice here, we actually have two node networking components. You might actually have three node networking components. If you have Z tunnel, cube proxy, and then your network policy CNI like Calico or Celium or something else, all these things are doing very, very overlapping responsibilities. They have a lot of the same features. They need a lot of the same configuration. So when we're talking about sending all this data for all these pods and all this workloads to Z tunnel, we may also be sending it just to another pod right next to it on the same node, right? So where I see the future is in kind of collapsing some of these building the functionality of Z tunnel into enhanced CNIs is what we, I tend to call them. So this way the overhead of Z tunnel can be amortized into the existing node networking cost, right? So the overhead in terms of configuration is essentially zero overhead because the configuration is already there on each node. We just need to upgrade those products with capabilities that Z tunnel gives. I will caveat that here I'm talking about the resource footprint of the configuration. Z tunnel is adding TLS. TLS is not free, but it has, you know, kind of extensive history of optimization and cost management for that. And in most cases, this would be replacing, you know, maybe a lower level networking security like WireGuard or IPsec, which has similar, if not higher costs than MTLS. The other thing we can do with the waypoints that's new is they don't need to run in the cluster, right? They can run out of the cluster if that's easier to manage. Or if you have a managed, you know, load balancer, they could potentially run there. They could potentially be shared in different ways than side cars, right? We have a lot more flexibility on how we deploy waypoints. So there's a lot of opportunities there as well to reduce the cost. So overall, I think while ambient mesh today is a huge step forward in scaling, we can even go further and make it so that Istio is kind of an obvious choice, right? You don't need to think, is it worth the cost? Of course it's worth the cost. The cost is almost zero, right? And then you can incrementally add more and more features and only pay the cost of those features you're using, right? You don't have to consume all the cost of Istio just to get MTLS. You can start there, gradually work your way up for important services that you want more functionality for and slowly expand your mesh. So I'm hoping that with all these changes combined, we'll see Istio deployed at much greater scales. Next time, I'm not starting at a thousand pods when I ask you all, I'm starting at a hundred thousand pods and I want to see some hands next year. So thank you, everyone.