 Yeah, we're going to talk about Istio, where Istio is today, and where Istio is headed tomorrow. Super, super excited to be here with Mitch Connor. So if you allow me to quickly introduce myself. I got you. Thank you, Mitch. Yeah, my name is Len Sun. I'm working open source as solo prior to join solo. I was a senior technical staff member at IBM. I wrote two books about Istio. One is Istio Explained. How many of you actually read that book before? All right, a few of you. And I just published a new book, Istio Ambient Explained. So very excited about that. I work in the Istio Steering Committee and Technical Oversight Committee. Yeah, now I'm going to pass on to Mitch. Hey, everybody. I think I've met a lot of you. My name is Mitch Connor. I'm a software engineer at Google. I joined the Istio Project way back in 2018, and I'm very fortunate to serve on the TOC today. I hold the dubious distinction of having the largest handcrafted commit in any Istio repository. Not sure that's something I should be bragging about. In my defense, most of those lines are deletes rather than additions. So it's not as bad as it looks. I also had the privilege of being a guest on the Kubernetes podcast. Check out episode 177 to hear about Istio just from a few months back, especially with the exciting announcement of the donation to the CNCF and a few other things. So we said we were going to talk about Istio today and tomorrow. And I'm going to cover mostly the today section. I'm going to use the keyboard. How many of you guys are actively using a service mesh in production today? Oh, right. Wow, OK. And if you're not, how about how many of you are evaluating service mesh usage? Cool. And if you're not doing either of those, you probably are in the wrong room, but you're welcome to stick around. We're happy you're here. Service mesh usage is really, really taking off worldwide. And the CNCF survey from 2021 showed that there's a shift into production that we're seeing. For the last five years or so, it was a very exciting topic. Everybody was talking about it, but it wasn't really making its way into production. So we've sort of crossed that gap and are moving on towards maturity in the industry as a whole, and Istio is no different. We have a great community behind the Istio project that I'm very proud of. We have 397 community members, and that's growing every day. It's actually, I wrote that last week, so we probably have a few more already. We've got 50 maintainers, seven working groups, with 15 leads. In the last 28 days, there have been 435 developers who have contributed to Istio. Now, this is not just pull requests. This could count if you've made a comment or opened a GitHub issue. If you've asked a question on any of our forums, we count that as activity in the community. So hopefully, a lot of you already are counted in that 435, and those are of you who aren't. We welcome you to jump in, get involved, ask questions, answer questions. Let us know what's broken, what's working well. And those 435 developers were spread off across 124 companies. You can, by the way, follow these stats on your own. CNCF collects them at devstats.cncf.io for all of their projects. So really proud of the community that we've built. And since particular, this is the maintainer track. I'll ask you, can you stand up if you're an Istio maintainer or have served as one in the past? All right, we got a couple spread throughout. So our maintainers are hard at work, making sure that pull requests and issues are handled in a prompt manner, making sure that new features as they roll out are stable and meet the bar of quality that we've set for you, our users. And we have a great team going on there. So in April, we had IstioCon virtually, which was really exciting, happy to not be virtual anymore, but we were thankful for what we had. And we laid out the roadmap for Istio along a few dimensions. We said that we wanted to go for stability, particularly API stability. If you've noticed, there's a lot of APIs in the Istio ecosystem, the Istio CRDs that are V1Alpha or V1Beta. We want to see those moving towards V1 because we want to be able to communicate to our users clearly that we expect these APIs to stay consistent. So there's a lot of pressure for maturity and promotion there. We wanted to make upgrades and troubleshooting easier. We heard loud and clear that day two operations for Istio is more expensive than it needs to be. And so we're working in improving that experience. We'll talk a little bit more about that in a bit. Improved extensibility through the WASM layer. We talked about expanding Istio's reach with IPv6 and ARM support and others, security hardening, upgrade automation. All in all, all of these things were focused around the central theme of improving day two operations. If you're not familiar with the phrase day two operations, day zero is sort of you're testing it out where most people were in service mesh three to four years ago. Day one is installation and setup. We spent a lot of time honing that back in 2019 trying to make it better. We feel like it's a pretty good experience today. Day two is keeping the lights on, making sure that every time there's a CVE related to Istio or envoy our data plane, you're getting your deployments, your Istio control plane up to date, your Istio data planes re-injected so that you're protected against those CVEs. Also troubleshooting when the mesh starts doing something that you didn't expect it to do. Our day two story is improving. And we're going to look at that in a little bit. It's still got a ways to go. And so that's still what we're working on today. So let's take a look back and ask how we've done. In terms of API stability this year, 113 saw the workload group go to beta. And by the way, internally at the Istio project, we use beta as the signal that it's production ready. If it's alpha, we would advise you not to deploy the API in production. It may disappear in the future. But beta has a support policy of I think it's longer than three months. I think it might be a year. Anyways, it has an extended support policy. And we are very careful not to deprecate or to modify beta APIs in breaking ways. That being said, we do still want to take these to V1. OSZ dry run so that you can see how your OSZ policy is going to apply to traffic before hitting the Go button and accidentally breaking some really important traffic has graduated to alpha. 115, which launched in August, saw Istio Cuttle uninstall go to beta. It's now a little bit safer for you to just kick the tires on Istio. And then if you decide that it's not for you, you can use the uninstall command. And we will remove every trace that Istio was on your system. And then 116 release is coming up, I think, next week. We'd better get back to work. It's coming up really soon. And you are going to see the Jot or web token-based routing graduate to alpha, as well as external OSZ graduating to beta so that you can run integrations with your own OSZ systems. Looking at how we've done on upgrades, we have upgrade surveys. If you've ever used Istio Cuttle install or the Helm chart to install Istio, there should be a little message right at the end of the script that outputs on your terminal and just says, hey, looks like you just upgraded Istio. Let us know how you did. And hundreds of you have, or not how you did, how we did. Hundreds of you have clicked the link and let us know how we're doing on upgrades. We would love to hear from hundreds more. The overall story is that we're seeing consistent improvement from one release to the next in upgrade stability, which is a key metric that we're tracking for our day two operations. We're also still hearing very clearly that upgrades are still very hard, that our users are still struggling with it. So while we've seen a lot of improvement, we don't believe that we're done yet. I mentioned ARM support. We've seen a bill of materials added in 113 and future releases and we're quickly working on getting towards Salsa level one compliance for the project as a whole. We also have published an integration with Flux for upgrading your Istio installs automatically with the click of a pull request. So we're sort of reaching out and broadening our scope, working with partners in the ecosystem to make your lives easier. And I think that was it, right? Oh, there was one more thing. We're here. This is Istio's first time at KubeCon as a... I have already clicked the button. I'm sorry, I'm stealing Lynn's slides. This is our first time at KubeCon as an actual cloud native project. We've been lurking in the shadows for years. We've been submitting CFPs as a envoy-based service mesh that would remain anonymous so that we wouldn't get kicked out for talking about the wrong project. But now we're here. We're an incubating project and we hope to see graduation in the near future. So we're very, very excited about, oh and because of that, you can go to the Istio booth in the pavilion which I'm really excited about. We have maintainer track sessions and we have multiple users sharing throughout the week about their experience using Istio as well as I think a session on Ambient, I don't know, Gateway API with Rob Scott. Yeah. You can get your free T-shirt at Istio booth. Sorry, we forgot to bring any of the T-shirt here. And how many of you are excited about Istio being part of CNCF? Show your hands. Yeah, thank you so much, Mitch. That's a great overview of what we have done in the Istio community. I feel I learned a few things from you too. All right, so now let's talk about the future of Istio. I want to start talk about Psycars, right? As much we love about Psycars, there are some challenges with Psycars. The first challenge in my perspective, actually in the community perspective, I shouldn't say just my perspective, is about the transparency of the Psycars, right? If you ever use the Psycars, you probably remember you have to re-inject the Psycars, right? It's kind of a baby you have to carry with your application container and you have to keep upgrading it whenever there's an envoy CVE, right? So it's the operation burden on you to be able to manage and upgrade that Psycars. And then if you ever have a init container, or if you have your own Psycars with your application, a lot of times there could be conflicts with our own init container or with our own Psycars. And if you ever have any sequence issue between startup, the application container, and the Psycars, also at the shutdown time, there could be a sequence issue. How many of you have those issues? Yeah, some of you, for sure. We totally understood, right? That's why we were thinking about, you know, what about not Psycars? So if you're also using our server-first protocol, if you're using Kubernetes jobs, unfortunately Istio today doesn't support Kubernetes jobs or server-first protocol, like MySQL, so that's another challenging with Psycars. And the nutshell, it's not transparency. Some of you raise your hand, so you probably hit those surprises along the way, and it takes time to troubleshooting. And the other thing about the Psycars is, even though we do say it's, you can adopt Istio incrementally, right? If you need mutual TLS, or if you need traffic shifting, or if you need telemetry, you can adopt Istio incrementally based on the feature you need. But with Psycars, it's yes or no, right? Regardless of whether you need one feature or three feature, you have to do the Psycars because there's no incremental in-between. So you pay for the price of the Psycars, even though you just need mutual TLS. If you just need some basic layer for RBAC, you still have to get the Psycars running and operate it. Yup, so that's the other challenge of Psycars. The last challenge I would highlight with Psycars is over-provision the resource because you don't have a choice to say, I don't want Psycars on my part. For maybe you run 10 replicas, you only want two of your application parts around Psycars. You don't want the Psycars. You don't have that choice. You have to drag it down for every single part in your replica. Even though you could potentially have, just have two or three parts do the layer seven processing job for you. So that's the other challenge with Psycars. So how many of you have heard Istio Ambient Service Mesh? Wow, many of you. More than I expected, so good job, guys. So if you recall, on Istio.io, on the blog page, we published three blogs between engineers from Solo and Google to talk about why we launched Istio Ambient Service Mesh. What are the, how do you get started with Istio Ambient Service Mesh in a few minutes and what are the security trade of between Istio Ambient Service Mesh to Psycars? So that's super exciting. So we launched that, I believe, on September 7th. So we tweeted out that. I actually tweeted for the Istio project. And what's also really reassuring to us is Matt Klein, who is a founder of Envoy and really endorsed the architecture we put out for Istio regarding this new Psycars architecture for Ambient. So that's really, really cool. So today I want to take you through a quick overview of what Istio Ambient Service Mesh is. The time is going to be a little bit limited given the time of our session. I'll try to do my best. So essentially in a nutshell, Istio Ambient Mesh introduced a two-layer approach so you can have a multi-tendency per-node secure overlay layer, which is provided by this component called Ztono. So that provides, I'm sorry, click the wrong button. That provides the Ziotrust mutual TLS for you, provide layer for enforcement for you. That provides layer for telemetry for you. So that is multi-tendency, serving all the parts co-located on that particular node that Ztono is running. So with this architecture, you no longer need to run Psycars, so it's more transparent for your application. You don't have to restart your application part. You can just include your application as part of the mesh through a label technique. So it's reduced compute cost and it's simplified operation and also the transparency, make it really transparent to your application. The second, the two-layer approach, the second layer is the layer seven processing layer because we know our way was not designed to be multi-tendency. Our way has issues when it's running multi-tendency with noise label, with isolation, with cost attrition. So we want to make sure the layer seven processing layer provided by Envoy is single-tendency. So you are typically going to see that in Ambient, that's going to be per service account, or per namespace, whichever tendency model you feel comfortable. So we call that a waypoint proxy. So that it's running outside of your application part and can provide all the layer seven functionality you need. So yeah, so that's essentially it's the Ambient architecture. What's really interesting with the Israel control plan, so me to talk about 1.16. So one of the important feature of 1.16 is that we actually added support in the SciCa today. So that SciCa in 1.16 or newer can talk to pause in Ambient through either Z tunnel or waypoint proxy so they can interoperate, which is really, really cool. Because I believe maybe you have a strong case, you want to continue your SciCa, which is totally cool, or maybe you want some of your application gradually move to Ambient and then you can continue to talk to each other. So as far as the install Ambient, it's really easy. How many of you know is still installation profile? How many of you install is still using the profile, right? Most people are using default profile or you customize your profile. So Ambient is an installation profile that we provided in the Israel project. It's an experimental profile at the moment because it's not production ready yet, but we do expect the Ambient profile become the default when it's production ready. So essentially you use the Israel Cuddle to install Ambient service matches just one single command to get it installed and they essentially install the Israel C and I, the Z tunnel, as the demon said, running in every single node in your Kubernetes cluster and the Israel control plane and the Israel Ingress Gateway. So let's talk about if you want your application to be part of Ambient, what you need to do, right? So the only thing you need to do is label on your namespace. For instance, you just label it to be data plane mode Ambient and then every single part in that namespace without any change to your existing parts or any change to any of the new parts you are deploying to your namespace, it's going to be automatically part of Ambient. So that's really the transparency, the simplification of operation I was talking about earlier and then this is the two layer architecture approach where we have a secure transport layer that's provided by Z tunnel which is multi-tendency per node that handles jail trust and layer for enforcement for you. And then we have the waypoint proxy by the way which is optional, you only need the waypoint proxy if you need layer seven processing. So when you find out some of your application may need layer seven processing then you deploy the waypoint proxy but it's optional, you don't have to. So this is really different than the cycle approach so it's more gradually incremental adoption. All right, so let's take a deep dive a little bit more onto the secure overlay layer. So what happens when application A is going to talk to application B, right? So before application A attempts to talk to application B the Z tunnel here is automatically programmed as a certificate authority client to talk to Israel control plane. So it has the application A certificate to be able to impersonate on behalf of application A. Also the Z tunnel is served as a XDS client so it gets the configuration from the Israel control plane so it knows whether the application B whichever Z tunnel it needs to go to and it also knows whether application B has a waypoint proxy. So in this case you're going to see application A is sending traffic to application B that traffic is going to be redirect to the Z tunnel located on the same node so that's plain text. And then once the traffic reaches the Z tunnel the Z tunnel is going to check is there any existing tunnel exists between the service account A the app A use to the service account B the app B is using if there's any existing tunnel it's going to try to reuse the existing tunnel if there's not any existing one it's going to create new tunnel. So this tunnel is what we call HGP over H bone encapsulation which we're going to talk about a little bit more but everything here is mutual TRS it's very similar to what you see from sidecar to sidecar today except that it's encrypted with H bone. And then when the traffic reaches the destination Z tunnel it can base on the original destination source and forward to the destination which is application B here. Now in the case of application B does have a waypoint proxy remember when we talk about waypoint proxy it's going to be optional so if the app does need layer 7 processing so the Z tunnel on the source side is going to be smart to know by Israel control plane sending the configuration so it knows the destination has a waypoint proxy so it's going to send the traffic to the waypoint proxy first through the HGP, the H bone tunnel and which forwards the traffic to the destination Z tunnel so it's just going to have an actual hop and that actual hop does the layer 7 traffic shifting the layer 7 policy enforcement to the layer 7 telemetry collecting so it does a lot of layer 7 magic in that waypoint proxy. So in a nutshell the two layers we talk about is a key innovation in ambient so it separates the layer 4 layer and the layer 7 layer based on your business needs so you can choose the secure overlay layer which is primarily layer 4 and optionally you can add the layer 7 processing as needed. All right so this is HTTP based overlay network so that's when the Z tunnel talks to from source to destination or the Z tunnel talks to the waypoint proxy so what exactly is H bone right so it's all the traffic tunnel through a single mutual TRS communication using HTTP 2 connect tunnel and it actually fixes some of our problems we have in the cycle world that it no longer requires the metadata exchange stack it allows us to run Kubernetes jobs and it fix like the server speak first protocol which we had issue with cycle so it's very very nice so this is a typical tunnel looks like where you could have multiple requests flow through the tunnel so the tunnel is per source service account and target service account Pell and it's 15008 port by default in Istio. Let's talk about security right because I'm sure you've been wondering about Istio ambient service mesh how is the security compare with cycle right if particularly if you're already comfortable with cycle today so one question I would ask is out of all the Envoy CVE how many of you know how many of the CVE are layer four related and how many of the CVE is layer seven related does anyone have a rough answer? All right so I can tell the answer so we did a rough study in the past two years it's about a third of the CVEs are layer four related and most of the other CVEs are layer seven related right so what does this mean? So if you just need layer four functions running in Z tunnel is actually way more secure because two thirds of the CVE is not going to apply to you right so the problem of running a minimize of Z tunnel compare with a full blown cycle you really reduce your attack surface from that perspective so that's really cool so you can think about multi-tenancy Z tunnel think about if that Z tunnel is compromised it's similar as the cycle on that node every single cycle on the node is compromised and in theory if you know how to compromise Z tunnel you know how to compromise every single cycle on that node but the reverse is not true because if you know how to compromise a cycle doesn't mean you can compromise a Z tunnel because Z tunnel has much reduced surface of attack so that's for the secure overlay layer now for the layer seven processing layer it's designed to be single tendency so it's very similar as cycle so it's to emulate all the noise label cost attrition issues with multi-tenancy that's why we don't do a low level layer seven envoy we don't believe in a community that's the right approach and all of you know we don't write perfect code and we know that from an Israel project so if you are application developer you probably could be pulling out libraries that have your CVE or your code maybe have CVE so in the cycle board when your application container got compromised it could potentially have access to the cycle because they are co-located and potentially have the access to the rest of the services and it's your control plane but in ambient if your application container is compromised the surface is much, much reduced because you can't do much you can't really compromise the control plane simply because your application is compromised so that's really cool so we believe even though ambient is experimental branch it's new we believe when it's production ready it's going to be as secure as cycle and potentially even more secure than the cycle with that I think we have time hopefully to take some questions this is a survey by the session provided by CNCF certainly if you put your phone towards the QR code we would love your feedback and I think we have time for questions all right lots of questions I think we have a roving mic and we've got a question up front yeah thank you so the roadmap for all of our features goes from experimental to alpha to beta and we advise that our users not deploy to production until a minimum of beta for the feature so we're actively working on getting ambient mesh out of experimental phase and into alpha our internal goals are to have that relatively soon I would say in H1 of 2023 we hope to have it available at least as an alpha potentially even as a beta on that time frame for now we'd love for you to kick the tires let us know what you think it is a work in progress so there's a handful of things we know are broken probably some others that we don't know and it would be great if you could tell us yeah and also we have weekly ambient contributor meetings on Wednesday I actually run those meetings and Mitch helped me run them too so we would love to hear feedback from you guys and also have you guys contribute to ambient right even though it's documentation testing you know it's going to really help the project mature and we want to be partnered with you yeah so you always have to go through the Z-tunnel because the way we have is we have IP table redirect and routes so all the incoming and outgoing traffic to any of the paths as once they are part of the ambient it always needs to go through the co-located Z-tunnel so the way to think about it is the traffic between the cycle and the application container is plain text which is very similar as the traffic between your application container to Z-tunnel which is also plain text so that surface area is the same if somebody has privileged to get onto the node to run like the network monitoring they will be able to see for that traffic between your application container to cycle or between your application container to the Z-tunnel does that make sense? That makes perfect sense, thank you Yeah, great question and if you want the book Okay, thank you My question was actually very similar to what we had under what we had in Sudan it was one with the direct communications between Z-tunnel and Z-tunnel and A and one wanted to talk to people and then on and in other scenario if you had to go to the you know, a way proxy right, a way point proxy then we would use the way point proxy and then the communication would happen between Z-tunnel and not apply the way point proxy Okay, yeah, do you want me to take that on? So you're asking when is the way point proxy required and when is it not? Yeah, yeah So the way point proxy's job is as a policy enforcement point for L7 telemetry or sorry for L7 So if you need anything L7 from your service mesh be it route or path based routing or header based routing or L7 telemetry or L7 authorization policies then you need a way point proxy for that service and none of those will operate until you've installed a way point proxy using the gateway API Yeah, so basically you have the choice you tell is your control plan through the gateway resources to deploy the way point proxy if you need it but it's optional and up to you to decide We won't run any proxies you don't want Yeah, which is cost saving, right? Why would you pay extra if you don't need it? Yes, if you don't deploy a way point proxy it'll be only L4 you'll have all the encryption you need every time you leave the node but none of your L7 stuff will work Yeah, you will get layer four telemetry though Yeah, even with Z-tunnel and you will get layer four policy enforcement so you could have like basic RBAC Yeah, great question Do we? Yeah, in one of the slides you said the analytics and metadata exchange has What is that? I don't know, we'll extend that out That's what we use for telemetry in Istio in the cycle world So if you install Istio there's metadata filters installed for the telemetry we use to collect metrics in cycle Yeah, do you want anything? No, I think that covered it Yeah Yeah, that's a great question So there are two different projects and we have two completely different teams of maintainers and developers, the folks at Open Service Mesh are developing an awesome Envoy-based service mesh In that sense it is similar to Istio but its emphasis is going to be different You really shouldn't ask me too much about OSM We're competitors But I mean the OSM team has great people here You can talk to Keith Maddox at the OSM booth I've answered and dodged the question at the same time Yeah, we could be biased Go ahead, but I mean the one thing though I do recommend you to check out CNCF survey because the survey doesn't lie, right? So if you go to survey you can see what's the most popular service mesh What's the most deployed in production service mesh Yeah, so check those out In the proof of concept, that's correct But with the new modes supported for daemon set rollout strategies we believe that we can either eliminate or minimize the amount of time that there's an outage to something under half a second is kind of our target Yeah, and the other thing you want to think about Zetano is as you CNI, right? When you're CNI, whether it's Coleco or Celian that's been upgraded it might take a minor hits on the environment, right? So you want to budget your application to be higher available across different nodes too Also, if half a second is the wrong number if you actually don't care that much and five seconds is no problem or if half a second is way too long please come to the Istio booth and tell us about your use case This is the perfect time to have input into the development of ambient and how we steer the project So the labeling experience is consistent between the two, the label is different for ambient versus sidecar service mesh The big difference being that when you're using sidecars once you've labeled a namespace all the pods that are running in that namespace are immutable So we can't just add a sidecar next to them We have to actually have you restart all of the workloads in that namespace which causes them all to be recreated and at that creation time we're able to inject a new sidecar into each of those new pods With ambient on the other hand because we don't need a sidecar running in every pod we're able to capture the traffic dynamically without restarting or interrupting any of the existing workloads Yes, you don't have to move everything on to ambient at the same time You can have a mix of ambient sidecar and non-meshed services all communicating Now you can shoot yourself in the foot If you turn on MTLS strict enforcement all of those services that don't have a sidecar or ambient now just can't communicate with the other ones But by default you can communicate between them Yeah, so you can continue to stay on the sidecar just because you feel more comfortable with sidecar The project is not going to push you in one direction or the other Certainly we recommend ambient with production ready but you can gradually move whenever you feel comfortable Yeah, there's no rush and they can talk to each other too So that's the 116 I was mentioning So we added H-Bone support in Envoy It's your Envoy proxy in 1.16 So in order for any parts in Istio with sidecar to talk to parts in ambient they would have to be Istio 116 later The 30-year-old was about 44 months back talking with Kinect So to end the user, it's a little bit confusing because beta is already we are assuming providers out there but then when the basic premise is not in the uniform So I'll put on my user experience hat and say that sidecars or sidecar lists themselves you shouldn't actually care about What you want is a mesh that's easy to onboard that's easy to maintain and keep up to date and that is computationally efficient and doesn't add too much latency And oh, simple It is also simple to use and configure If you get those five things out of a service mesh whether it's being implemented in sidecars or in ambient mode or inside the kernel or by camels carrying packets back and forth across the Sahara you should not care The bottom line is what you care about is the features of your mesh We believe ambient mesh and the sidecars model will give you a better user experience than you have today with a sidecar model in Istio and that's why we're moving there But we don't want our users necessarily to be over focused on the implementation details We would like you to see the end results of why it's better Yeah, you got to pick and even though we believe ambient is better because it's long intrusive it's more transparent to you and it has a better onboarding experience we can provide Yeah, but you guys are really going to let me know and let us know maybe in six months Yeah, or even now if you are willing to try ambient in experimental mode Do we have time for more questions? So... Yeah, so if you have a multi-tenancy model that you have separate workloads that you identify differently but you're using the same service account for them you have much bigger problems than your service mesh You should have one service account per logical service things that could harm one another if you have a coke and Pepsi situation of multi-tenancy you absolutely need to have separate service accounts for those and then your waypoint proxy will also reflect that with those separate service accounts Yeah, that's exactly why we believe this is the right architecture to allow you to still have single-tenancy on layer 7 processing because they typically are pretty noisy and most of the scenario you may want to extend it using like Wazem so it's good to be single-tenancy So I think there were questions that we couldn't get to We'll be at the Istio booth I'll be there for the booth crawl tonight from 6 to 8 I'd love to have you stop by and have a chat about how you're using service mesh Yeah, and we'll be handing out our Istio ambient book so if you want to read up on that book I'll be at Solar Booth and also the Istio booth too Yeah, thank you so much for joining