 All right. Oh, hi everybody. It's so excited to be back and with Justin. So Justin, why don't you introduce yourself? Hi, I'm Justin. I am working on Istio for Google. I've been working on it for about three years. Most of the work I've been doing for the last couple has been sort of in preparation for this ambient work. Prior to that, I used to work on a project called Open V-Switch and the OVN, so it's a little bit more in the SDN world than the service mesh. So I've been enjoying what I've been seeing so far, though. Awesome. Yeah, I'm so excited because Justin and I have been working on Istio Ambient Service Mesh for at least six months. And also, even before that, in the Istio community, Justin, I believe you presented some idea to the Istio Networking Workgroup. So that was over a year ago. So it's really cool to be together with Justin. I think some of you know me from early on, but just real quickly, I work for Solo. I'm one of the founding members of the Istio community, sits on the technical oversight and steering committee. And most exciting about me is Istio Ambient. I actually wrote a book about Istio Ambient, Explained with Christian. So if you guys ask really good questions for our session, you're going to get one of our books. All right. I'd like to start talking about challenges with psychos today. The biggest challenge, in my opinion, is transparency. So it requires the injection of psychos, right? How many of you are running startup or shutdown sequence issue with psychos today? Nobody? Somebody? Got to be somebody. There's actually a lot of complaints we heard from our user in the Istio community. And I'm sure it's all across different service mesh projects, too. And then how many of you are not too happy the cycle upgrades that require restarting of your application? Yeah, that's right. And how many of you ever wanted to run jobs, Kubernetes jobs, but find out, all right, it doesn't really support jobs with the cycle? That was a little bit frustrating, right? And what if you have server-sent first protocols, like MySQL, right? How many of you had the requirements to run those in service mesh? Sorry, hopefully I didn't cost that. Yeah, so these are the challenges with psychos, right? You always have to drag something with you, and then it has frequency needs with CVE upgrades. And the second challenge, I would say, is incremental adoption, right? Most of you adopt service mesh because mutual TLS, but psychos doesn't offer a choice. They can just adopt service mesh just for a fraction of the cycle. They still have to carry the entire cycle with their application container, even though they just need one single piece of the service mesh, which is typically mutual TLS. It's really all-or-nothing model with the cycle today. The third thing, in my opinion, with cycle is over-provision of resources. Let's say you have 10 replicas for your services, and each of your 10 replicas always have the psychos that's along with them, even though you may only need three parts to do the job your psychos is doing, whether it's mutual TLS or traffic shifting or layer 7 processing. So you don't have that choice, and you have to always pay for that cost of psychos. With that, I'm going to pass to Justin to talk about ambient service mesh. Yeah, so the plan with ambient was we wanted to address those issues that Lynn was mentioning. So sort of the most important one, from at least my point of view, was not being disruptive to the applications. So having the proxy be a sidecar, the way that those sidecars are instantiated, it requires restarting the application. So you can't just hot insert service mesh into your workloads. You have to do something that's fairly disruptive. And because we're doing full L7 processing, there's also the chance that we can break some applications that have non-compliant HTTP stacks. There is some stuff with the way that we do MTLS upgrade in Istio that could cause some breakage. And so we wanted to address those things. And then also just when there are CVEs that need to be addressed in the proxy, it requires, if you want to upgrade, it requires restarting the workloads as well. So we wanted to be much less disruptive to applications. As Lynn mentioned, currently we have to, in Istio with sidecars, you have to over provision for the proxies. You have to, for each workload, you need to have a proxy and allocate resources for worst case scenario for the traffic that is likely to be seen on that particular application, which means that you end up telling Kubernetes that you need additional resources that you end up never using so that a lot of it just goes underutilized. We also wanted to make sure though that sidecars still have their place. There are a number of applications that require the security model or the approach that sidecars have. And so we also wanted to make sure that whatever we did also continue to interoperate with sidecars. And as Lynn was mentioning, we also currently, the way that Istio would work is if you wanted just even a feature like MTLS, you had to take all of Istio and all of the complexity you needed sidecars and you needed to restart your applications. And we wanted to have a smooth path so that sort of the amount of work that you had to do was related to, or to get it installed was related to the complexity of what you wanted to do. So we wanted a smooth upgrade path from something relatively simple like MTLS encryption to full Istio. So here's a picture of traditional sidecars. So you can see that we have five workloads here. And then for each of the workloads, there is a sidecar proxy within the same pod. So the approach that we took with Ambient is that we broke apart the proxy into two parts. And then we also are looking now at providing policies, separating L4 policies from L7 policies. So at the lowest layer, we have the L4 policies, which we call the secure overlay. And so instead of having the sidecars, we take the proxy that we're calling Z-tunnel and we run that as a daemon set on the node. So instead of having one proxy per workload, we have one proxy per node. And then that Z-tunnel is responsible for doing the MTLS connections between it and other Z-tunnels. And when I talk about how L7 works between the Z-tunnel and a waypoint proxy. Now, once we want to introduce L7, we still have the Z-tunnels. Nothing changes in the workloads themselves, but now we introduce what we call a waypoint proxy. And that is a full L7 Envoy proxy that we're deploying that does everything that you would normally expect Istio to do. And rather than make it an all or nothing, we configure it per namespace or per service account. So let's say that you had these S1 and S2 workloads that you are in one network names... Sorry, Istio namespace. Then if you had L7 policies for those, what we would do is we would spin up a waypoint proxy and then we would redirect any traffic that would... Before, in the previous drawing here, we saw that a C1 wanted to connect to S1. It would just go over that tunnel. But now what we do is we tell any traffic that needs to reach that S1. It needs to go through the waypoint proxy and we redirect the traffic. And Lynn will get into a lot more of these details when she does the packet I view of tracing the traffic. So as I mentioned, we're now breaking the Istio into two different... consciously choosing two different layers. So we have the secure overlay layer, which is just doing L4 processing. And so in that case, we can't actually do anything L7 related. But we can do any L4 policies could be done in that Z-tunnel. And so we can do TCP routing, TCP metrics, MTLS tunneling. But if you want things like authorization policies on URIs, that requires the waypoint proxy and then that gets deployed and that's that L7 processing layer. Another thing that we introduced with Ambient is something that we're calling HBone. So previously when we did an MTLS connection in Istio, we had sort of a hacky method that we would use for upgrading every connection to MTLS that was sort of fragile and it could break some applications. And so we have defined this new protocol that we're calling HBone. And it's really HTTP Connect with just sort of standard headers within the HTTP header. And then we use MTLS for the encryption. And so instead of previously, if you used Istio and you wanted to encrypt the traffic, if we had these three connections on 80, 443 and 980, each one of those would be a separate connection to the proxy. Instead what we're doing is we're opening a single HTTP connection and then tunneling all of the traffic through that, which cleans up a lot of the issues that we were seeing with the previous method. It also makes it easier now when we have those waypoint proxies because we've defined a standard tunneling mechanism to carry some of that identity as traffic has to go out of its way and not just going from workload to workload. So one of the questions that we get, the first question that people often ask about this is what the security model is because obviously it looks a lot different from the sidecar where before it was pretty easy to understand that the sidecar did everything and it's just sitting next to the workload. But now we have multiple shared resources. So I wanted to just briefly touch on this. We actually wrote a blog when we announced Ambient Mesh. It's available on the Istio website if you're interested that goes into a lot more detail about how security works. But there's sort of three different levels that I think of. So first is the way that the application interacts with the service mesh. And so it's actually an advantage now not to have the workload co-located with the proxy. So previously if there was a vulnerability in the application then that application could actually attack the proxy. And also if the application was just misbehaving the current way that we would send traffic through the proxy is we would just use IP tables rules and redirect the traffic. But it was pretty trivial for an application that is misbehaving to just bypass the proxy entirely. So by moving the enforcement outside of the workload we actually have better protection than we previously had. Another one is the Z tunnels. And so obviously there is some concern that people have when they first hear about this design about having this Z tunnel have the certificates of all of the workloads. But we try to mitigate or we mitigate that in a couple different ways. First of all when the workload requests the certificate or so when a workload starts up the Z tunnel notices that and then it requests from the CA for the certificate for that workload. And then that CA actually ensures that that workload is available. It authenticates the Z tunnel and then it makes sure that that workload that is being requested is actually present. So that not a Z tunnel can't just ask for anybody's certificate. The other thing is that the Z tunnel I mentioned that is just doing L4 processing. So it has a much smaller attack surface than the full Envoy proxies that we were doing before as sidecars. So when we look at the Envoy CVEs most of them are related to the L7 processing that's happening that we are not doing at all in the Z tunnels. And then finally there are the waypoint proxies which are shared. But the way that we share them are per service account or per name space. So those workloads already have a common identity. So really having the waypoints shared doesn't reduce the security at all because it's not really any different than the sidecar. So I'm going to pass it to Lynn to walk through the packets. Okay, great. Thank you so much, Justin. That was a great overview of Istio Ambient Service Match. I'm really excited about the security stuff you were just talking about. So a disclaimer, we don't have a live demo because we actually have multiple live demo on YouTube. So if anybody wants to see an ambient live demo just go to the Istio channel. The market walkthrough we are going to have is going to focus on the pieces relevant to Istio. So I'm not going to dive into like Kubernetes networking and how the container networking works just because we don't have enough time for the talk. So as Justin was just mentioned, if you just need mutual TLAs or layer 4 functionalities you basically only need a Z tunnel. So what happens in this case is the client, which is app A in this diagram. So the app A would come over and, you know, and then when it's attempts to send the traffic, request to app B, what happens? Sorry about that, hiccup. Sorry about that. This is a single point of failure. So God bless. So when app A is attending to send a request to app B, the Z tunnel, we set up traffic redirections through IP table and routes. So the request would be captured by the Z tunnel that's co-located on the same node. And then the Z tunnel would impersonate as app A and send the request through the edge bone tunnel that Justin was just mentioned and then to the target Z tunnel. And then the target Z tunnel is intelligent to forward the traffic to app B. So that's how it works at a high level. So let's take a dip into how it works. So the first thing is you would need to include your application in ambient service mesh, right? Else we don't know whether you want your application part of ambient and to do that the simple way to do that is label on the namespace. So for anything you want to be part of is your ambient service mesh, you just simply label that namespace with ambient and that's how we know you want to be part of the ambient. So essentially in is your config map there is an ambient mesh configuration by default that configuration is namespace. But eventually when you feel comfortable about ambient you want to enable it for the entire mesh you could enable it as for the entire cluster. So that's also available. But for now let's do it on the namespace as people looking explore ambient and transition to ambient. So another key component is Istio CNI. We have expanded Istio CNI to work with ambient. So what it does is the Istio CNI would do a check to see if the application is part of ambient based on the namespace label we just talked about. And if it's part of ambient we are going to set up the traffic redirection to have the Z tunnel co-located with the application part to always capture all the incoming and outgoing traffic. Similar as the IP table setup with the Psycon and application container today except we're doing it between the application pod and Z tunnel co-located on the same node. And if the pod is not, the app is not part of ambient we would essentially do nothing. So it would behave the exact same as you are not part of Istio or ambient service mesh. So let's walk through how Z tunnel works. Before you even send any request from app A to app B the moment you stand up your pod in that particular namespace that you labeled as ambient Z tunnel is going to serve as XDS client. It's going to try to get the application get the configuration from the Istio control plane. So essentially what Z tunnel is going to send the request to Istio control plane on 15012 and then the Istio control plane finger out oh you are the Z tunnel on that particular node and this is your specific XDS config because if you are Z tunnel on a different node your configuration will look differently. Now let's talk about as you add applications to your namespace that has this magic ambient label Z tunnel is also served as a certificate authority client. So in this case what Z tunnel is going to do is Z tunnel is going to say hey this is my service account token for Z tunnel now this application application A is can you give me the search of the application A so that I can impersonate application A. So the Istio control plane is going to do the check that Justin was just mentioned whether the pod for application A is co-located on the same Z tunnel that's sending the request to see if it's application A and then send back to say if it's allowed hey you are allowed to represent application A and here are your search so essentially a Z tunnel on the node is multi-tendency it can represent application A or it could represent any other application that's part of the ambient and co-located on the same node. The next thing we want to talk through is how the traffic redirects right so when the application does send a request to from application A to application B on pod 80 in this example so the connectivity between application A to Z tunnel right now is actually plain text in green the reason why it's plain text is because we don't encrypt that traffic at the moment and the Z tunnel would upgrade the connection through encrypted traffic through the edge bone once the traffic hits the Z tunnel so this is also what Justin mentioned earlier it's very very similar as how your application container works with the side card next to it it's also the plain text traffic. So now at the moment the Z tunnel for app A it also have the right XDS configuration so it can think out you know where is the target what is the target of the application B Z tunnel it needs to reach out to. So the next thing we're going to talk about is on the source Z tunnel what's going to happen when the package arrives on the source Z tunnel so essentially the source Z tunnel is going to think out what is the destination Z tunnel based on the XDS config we just talk about how it gets the XDS config and then from there it's going to try to think out is there an existing edge bone Z tunnel between application A and application B service account pal it's trying to reach from this particular source and the targeted Z tunnel if it is yes let's go ahead reuse the existing tunnel right be more efficient and then if there's no existing pal it would create the new tunnel so if you record an edge bone slide that Justin talked about this is the edge bone tunnel and the logic trying to decide whether to reuse. Now as we think out where the destination is and think out whether to establish the edge bone tunnel as the package reaching out to the destination Z tunnel so by the way this traffic is encrypted it's mutual TLS because the source Z tunnel needs to present the app is certificate and the destination Z tunnel using similar way it's can impersonate application B so it also serves as XDS client and also CA client so it gets the app B search here and it also gets the right configuration so now what happens when the package arrives at the destination Z tunnel the first thing it does is to terminate the mutual TLS that the traffic is terminated and then it's going to do a policy check but Justin talked about you can do a layer four policies so it's going to check if application A is allowed to call application B right on that particular port number and if it's allowed yes let's go ahead forward the package to application B and there will be plain tags forwarding and if it's not allowed it's going to drop the package so it's going to go through Justin also talked about waypoint proxy one of the key innovation is your ambient service mesh is the two layer approach right the layer four is multi tendency with Z tunnel and optionally you can add layer seven through waypoint proxy only if you need that functionality so in the case there is waypoint proxy for instance you need layer seven processing so the source Z tunnel it would be programmed automatically by Israel control plane through the XDS client and configuration from Israel control plane and then it would know to send the traffic to the waypoint proxy so the waypoint proxy it's going to represent service B service account in this example and the waypoint proxy is also going to get the application B search from the Israel control plane and the interesting thing here about waypoint proxy is it doesn't represent more than one search it only represents the application B search so it's going to go through the same car to Israel control plane which we talk about and then from there the waypoint proxy is also having XDS config from the Israel control plane so it knows to forward the traffic further to the destination Z tunnel in the case of the waypoint proxy alright so that's how it works in the basic scenario with source and target Z tunnel and whether you have a waypoint proxy in for your layer 7 processing now let's talk about how Psycar works in issue 1.16 and master it's recently just added support for H-Bone for Psycar so shout out to John did a lot of work on that in our audience so the purpose reason why we put H-Bone support for Psycar is for interoperability right when you guys run Psycar with 1.16 and then when you use NBN you can actually have them talk to each other the reason is the Z tunnel speak H-Bone and then the Psycar post 1.16 or newer can also speak about H-Bone so the two can easily interoperate which is really really cool the next thing I want to talk about is what if the source is in ambient and the application D is outside of mesh so in this case it's still control plane is going to be automatically program the source Z tunnel to recognize the application the destination is outside of the mesh and it's going to try to send the plain text traffic because that's only traffic the application D could understand and the other scenario is what if the source is I guess is outside of the mesh right the application E here and the destination is application B is part of ambient and then it also have a waypoint proxy in this case right so the way it works is application E will send the traffic to application B and because of the traffic redirect set up by the CNI it would always capture the traffic on the destination Z tunnel and then the destination Z tunnel will do a high opinion and forward the traffic to the waypoint proxy for application B if application B has a waypoint proxy so this would come through the edge bone encapsulation and it would be using the application B search on both the Z tunnel and also the waypoint proxy alright I think that's all the scenario we have for different package I've used I guess I just want to quickly talk about the state of is your ambient service mesh at the moment so it's an experimental branch in Istio we're working really really hard in the community to merge it to master John actually started a list of things that we need to finish before drive ambient to master and people can start to run it in production so give us a little bit more time we do expect to be merging master is a very high priority in upstream we actually have weekly meetings with contributors from Google Solo and AWS cloud Red Hat and a couple of other companies participating in the meeting so really really excited about that we're also looking at optimized Z tunnel we had some challenges with configured Z tunnel to be multi-tendency with our way cause our way wasn't designed to be multi-tendency so we're looking at Z tunnel we're trying to standardize the API between the control plane and also the data plane and you know be able to upgrade allow you guys to upgrade from cycle today whether you're running it in your production environment to is your ambient and we highly recommend you all to get involved in the Istio Slack they are I believe is not part of the member register there's an ambient channel part of the Istio Slack and the weekly contributor meeting with that I would like to thank you all and I would love to hear questions from the audience alright anyone have any questions alright I actually have two questions if that's okay so even though it's a layer for proxy do you think it becomes a bottleneck for everything running on the node and how do you scale it as the workloads are scaling up and my second question is we talked about reduced operational overhead but it seems that the overall net complexity of the system would you say it's comparable to the sidecar approach or it's you know at the same order of magnitude just delegate it more to the node for example if we were trying to debug would we have a harder time debugging or an easier time debugging so first of all on the z-tunnel so Lynn had mentioned that we're looking at optimizing the z-tunnel so currently the way it's implemented is we've taken Envoy and we've modified it we have to create a lot more tunnels than I think Envoy is usually set up for and so the scale isn't great for that but we are looking at rust implementation that's showing looking quite promising so even with like 20,000 tunnels we're seeing it use less memory than Envoy so we think that you know there's work to be done but we don't think that that scaling will actually be an issue in practice by the time we finish optimizing it and then the second question about the complexity I would say you're right I think it is a bit more complex than the previous model but the sidecar has its own set of complexities when debugging them but we will have to come up with a different way of debugging so we have to look at introducing new debugging tools to make it easier to trace for example where packets are getting dropped if something's going on in the network that sort of thing so we do recognize it's something that we need to work on to make it deployable Yeah, can I ask you have you tried it is your ambient service mesh? Okay, well the question you asked got really good which made me think you already tried it I mean I totally agree with what Justin said I would say the other thing you want to also look at is the complexity with sidecar because if you have a unique container if you have Kubernetes jobs if you have service and protocol you know the transparency to the hurdle to get it working with sidecar you know you're actually paying a lot of complexity and then ambient is it's a baby at the moment right it's experimental so give us time to work through the debugability you know a lot of performance issue ambient is actually as performant as sidecar even with its experimental stage so that's very impressive how many years do we spend to work on sidecar five plus years and ambient is just you know six months plus so yeah, great question alright that's another question Hi there, that was a great talk so I wanted to understand a little bit on the resiliency of the Z-tunnels so how do you build the resiliency because what if the Z-tunnel has a failure and say there can be multiple reasons for the failure so how do you build the resiliency in such cases because I'm sure more and more the user just come the failures will keep coming as well right so yeah I mean I think that one of the hopes is that the Z-tunnel the way that we're implementing it will be much simpler than the sidecar proxies so you're right I mean there is going to be that single point of failure I mean we have other examples of that also within you know within Kubernetes with the networking stacks so it is something that we will need to look at but I think that the simplicity of what we're trying to do will create a much more resilient system than it has been in the past Yeah I would add that the way to think about the Z-tunnel is think about it as your CNI right what if your CNI is not running on the node right it's real but it could happen right that's also it's really important when you design your application you want to make sure it's higher availability across different nodes so yeah just think about it as a CNI part of the infrastructure Hi just thinking about layer 7 handling and that extra envoy is there already or do you anticipate there being something to say make sure that network hop stays within the zone or any other optimizations that you anticipate yeah that's a really good question yeah we've actually heard a lot of people asking for that the way we implemented it right now at the moment how you deploy a waypoint proxy we haven't talked about that so it's using the Kubernetes gateway API so you basically tell us you need a waypoint proxy for your application and then we would automatically provision the waypoint proxy for you the current API doesn't really allow you to optimize on the placement of the waypoint but that's something we're definitely looking into to allow you to placement it nicely in a particular maybe zone or region that's co-located with your app or whatever you need yeah with the persistence H-Pone connection how does low balancing work are there multiple connections to endpoints or is it or does traffic just always flow to a single destination service do you want to take that? yeah so the way that it so currently works is that there would be a new connection for each workload tuple and so right now I think they all share a single H-Bone connection I imagine it wouldn't be that difficult to do some sort of load balancing for like Z-Tunnel to Z-Tunnel we are looking at the ability to scale the waypoint proxies in which case there would be some sort of L4 load balancing so if one of those namespaces was enabled it had a number of it needed a number of waypoint proxies then we would do the load balancing on the client side thank you this is a very interesting talk with moving traffic to lower in the stack have you had to make any concessions about surfacing metrics and observability and do you have plans to implement that feature in the future if you have how to make concessions? yeah that's a great question it's something we are actively working on I believe layer 7 telemetry it's in place with the experimental branch a couple of us actually tried it so you can actually go to the permissive endpoint to get all pretty much all the is geometric like the total the H2P metrics that's all available today with the layer 4 telemetry it's been actively working on you'll see on my team been doing a lot of work on that it hasn't made to the branch yet we also have red hat team from Kayali who actually try to put the services in the Kayali graph and they've been raising some of the missing metrics so that's definitely an area we would look into next to show it up the observability graph in different tools yeah I will add one thing though which is in order to get the performance up we try to drop one of the envoy hops so if for example previously you would have a sidecar on both sides of the connection now what we want to do is we have just a sidecar but we just want an envoy proxy more on the server side to enforce policies so that means some of the telemetry that you might get for request latency that's not going to be available just because we're doing a different model we do have some thoughts about how we might be able to bring some of that back but currently that is missing I have to apologize for the delay it's 1056 because I'm the emcee for the next session so really apologies I just didn't keep track of time really well so I would say thank you all for joining us for this session I hope you guys enjoyed it and learned something and for folks you ask good questions come by and get your books here so we'll see you next time