 All right, good morning, everyone. There is nothing to get a speaker's blood pumping like the audio-visual gods frowning on you before you are the first speaker on the stage for the day. So my adrenaline's going, so we have about 37 minutes left. So let's get rolling. Good day. Welcome to Open Source Summit. Today we are going to talk about how to simplify the operations of a service mesh using the new Istio Ambient Mesh capability. So my name is Jim Barton. I am a field engineer with Solo here in North America. My career in the enterprise community spans three decades. I've been with Solo for nearly three years now. And prior to Solo, I was an enterprise software architect at companies like Amazon and Red Hat and Zappos.com. Just very quickly, WhoisSolo.io. Solo was a company that was born in the cloud that specializes in helping enterprises manage the complexities of application networking in a cloud native context. We do this via community leadership in strategic open source projects like the Istio Service Mesh, like Envoy Proxy, and GraphQL. And also by offering enterprise-grade products and services based on those projects. So I want to set the stage for this discussion with a story that takes us back to an event that occurred just before this photo was taken at my engineering school graduation over three decades ago. If you look very carefully, you can see that I had a lot more hair in those days. And so I suppose all of this is a result of 30 years spent in the enterprise computing space. So take that as you will. So let me start off. I want to show you a little video clip here. We'll just sort of run this in the background. So the story that I'm going to tell you arises from a competition in which my university's IEEE student chapter competed against other schools in our region. It was an autonomous robot car racing tournament. Looks very much like this video here. Each car had to trace a thin black line around a course with a lot of twists and turns. And the winner of each dual heat moved forward in the competition until you got to the finals. So let me tell you this was a high energy competition. There were entries from several engineering schools within our area. And there were some really interesting designs. Some of them were pretty complex. Some of them were honestly kind of funny. But there were two notable vehicles that worked their way through the single elimination bracket. One of them was this huge behemoth of a vehicle. And it had this giant cylindrical core that was surrounded by a three-legged superstructure that rotated. Each of the three legs was supported by its own wheel. And the legs were packed full of D cell batteries to power this contraption. It teetered right on the edge of the size and weight constraints that had been established for the competition. And it required two college students just to physically carry it to the starting line to begin the race. It was quite a vehicle to behold. And so this vehicle came from a large, very prominent engineering school in our region. And it came with its own massive student delegation that thought they were at a football game. And so every round they would win. There would be these massive cheers that went up from their cheering section. But then on the other side of the bracket, there was this other entry that kind of worked its way through as well. And from a design standpoint, it was the polar opposite of the behemoth that I just described. It was a tiny little device from a small engineering school that sent only a single delegate to the conference. He utilized what appeared to be the chassis of a child's slot car. And there was this little sweeping photo detector at the front of the car. And the whole thing was driven by a little AA battery that appeared to be wired into the chassis. And surprisingly, it started defeating its much larger, much more well-funded competitors. So after an hour or so, we reached the tournament finals and there were these two vehicles left standing. The massive behemoth with the rotating legs and the giant cheering section versus the single, the tiny single designer slot car. So who won the coveted championship trophy in the tournament finals? Come see me after the talk and I'll, no, I won't do that, I won't do that to you. But do stay around until the end of the talk and I will tell you how the tournament finals worked out. All right, so why do I tell that story? Well, I learned a lot of things in my senior year of engineering school. I'm sure I've forgotten most of the specifics by this point, but this particular event seared into my consciousness a lesson that has served me well over the intervening three decades and that is the principle of parsimony. Parsimony is quite simply the law of simplicity. When you have a design goal in engineering and there are multiple paths to reach that objective, all else being equal, the simplest path, the fewest moving parts, typically represents the best option. In short, simplicity trumps complexity. So as we transition into discussing the future of service mesh technology, here's the way I see this whole marketplace trending right now. It's all about simplifying, simplifying architectures, simplifying adoption, simplifying operations. There have been some really exciting developments along those lines just in the past few months and we'll be exploring those together over the course of this talk. But to set up that discussion, let's go back to the beginning. Let's say you're living in prehistoric times, circa 2017, and you are a financial institution or perhaps a government agency, maybe part of an enterprise in some heavily regulated industry like telecommunications, and you have two services that need to communicate with one another, service A and service B. And in that prehistoric time before the dawn of modern service meshes, you have a regulatory requirement that communication between all of your services, even your internal ones, all inter-service communication must be encrypted using MTLS. So how would you satisfy that requirement? Well, in those days, maybe you'd Google TLS libraries for the languages or frameworks that you're using, you'd integrate those into your application stacks, and then you'd test it and deploy it and you would mark the requirement as satisfied. And that approach is fine as far as it goes, but for the people in this room, we don't have the luxury of living in such a simple world, do we? No, we don't. Instead, we operate in service ecosystems that often look much more like this. I spent a number of years at Amazon and one of the posters we had in our office depicted what we affectionately called the Death Star. It represented a snapshot at some random point in time of all the services in the internal Amazon application network and all of their interconnections. So the simple approach we envisioned on our previous slide of individual services building out their own unique solutions and observability and security and connectivity, it just doesn't scale well at all. So these challenges caused our community to ask a number of important questions about how we could achieve vast microservice networks at scale. How might we secure service to service communication? How might we manage things like resilience concerns like timeouts and communication failures? How can we control and route traffic? How might we publish consistent metrics and observe interactions among services? And how might we do all of this in a simple and repeatable way that embraces important principles like infrastructure as code and declarative configuration? So we realized as a community that a lot of these concerns were cross-cutting, right? They applied to all services within our networks. And so rather than taking all on all of this undifferentiated heavy lifting within each application team, wouldn't it be better if we could externalize those capabilities and handle those cross-cutting concerns in some external component? So the first data plane architecture that emerged from that era in most service measures utilized a technique called sidecar injection. You see that depicted on the after side of the slide on the right. A sidecar container lives next to the application workload and handles all of those cross-cutting concerns like connecting, securing and observing on behalf of the service to which it's attached. So we could remove then a lot of these undifferentiated heavy lifting burdens from our application teams and capabilities like the ones enumerated on this slide could be handled more effectively using service mesh infrastructure. We won't walk through all these details here but as you can see the key pillars fall into the categories of connecting, securing and observing. So where the service mesh community landed from an architectural standpoint and this is true whether we're talking about Istio or any of the other service mesh alternatives that are out there today is we have an architecture that has basically two components, a data plane and a control plane. So the data plane is responsible for request processing and routing. What happens is that an envoy proxy would be injected into the service pods as a sidecar and this sidecar is responsible for all those cross-cutting connect, secure observe capabilities on behalf of the application service. So when service A needs to talk to service B the envoy sidecars intercept all of that inner service communication and manages things like securing the communication channel, managing request timeouts and failure policies, publishing metrics and so forth. And so if you have a mandate for securing inner service communication across a vast application network then by injecting these sidecars and applying some pretty simple declarative policies you can achieve that objective within your service mesh without requiring modification to any of the underlying application workloads. And so that's the data plane. But the data plane can do nothing without a control plane to specify policies that tell it precisely what to do. So the control plane is the brains of the operation and when outside stimuli are applied to the mesh let's say a new service failover policy is being applied then the control plane is responsible for ingesting that new policy, translating that into configuration for all the applicable envoy sidecars and then pushing that configuration out to those sidecars throughout the mesh. At a high level that's been the general state of service mesh architecture for some time now. And so let's be clear here there are many enterprises who have been and are now successful in utilizing this architecture. And it does solve a lot of problems in managing complex microservice networks without requiring changes to underlying service code. But let's be honest, there are operational challenges that have emerged as we've gained experience with service meshes. We've learned a lot over the past five years since Istio went GA with its one dot over lease back in 2018. This is a great quote from a senior engineer recently of T-Mobile named Joe Cersey. He spearheaded a lot of T-Mobile's early service mesh adoption and he says this, the biggest enemy of service mesh adoption has always been complexity. The resource and operational overhead to manage the service mesh for a large enterprise makes adoption cumbersome even as projects like Istio have worked to decrease complexity. So while Istio does what it does very well there are certainly opportunities that we have learned as a community of how to simplify its operation and lower its overall cost of ownership. So let's explore some of the dimensions of this complexity that the community is seeking to simplify and we'll follow that by delving into some of the emerging innovations in the community that address these challenges. So first off, number one, let's take a look at operational complexity. This diagram depicts a typical Istio deployment obviously reduced to a much smaller scale than usual. You can see that we have three nodes deployed in our Kubernetes cluster. There are a number of apps deployed, some of them with multiple instances deployed for enhanced availability and perhaps performance. Because we're using Istio, we have an Envoy proxy side guard that you can see depicted that's injected into each one of those application pods. Now, let's consider some of the operational issues that could arise here. Let's say that you're applying an update to your service and perhaps you need to upgrade to a new version of the Envoy proxy itself. Let's say there was a CVE that came out that required you to upgrade your Envoy proxy. And so when you recycle these services, you might encounter a race condition where a service container spins up more quickly than its associated sidecar, something like that. And so service requests might fail because the sidecar that's responsible for intercepting and routing those requests is not yet available. So you potentially suffer a number of failures. Perhaps the service can't even initialize itself properly. There's a mirror set of problems that happens on the other side of that when the service is attempting to shut down as well. And of course you can design around these sorts of issues, but our goal with Service Mesh is to simplify the life for our application services, not to make it more complicated. And so we'd like to simplify some of these operational complexity issues. All right, another challenge with a traditional Service Mesh architecture arises from the fact that we want the mesh to be transparent to its resident applications, but the reality doesn't always match that objective. For example, our diagram now depicts a case where we have a Kubernetes pods that contains a workload called Job 1. There it is. And that job needs to run for some period of time, complete its mission, and then shut down. So the operational problem we encounter is that when jobs like this shut down, sometimes the injected proxy sidecar is still running. And that can prevent the environment from cleaning up its pod in a timely fashion and recycling those resources for use back into the cluster. So we'd like to clean that up as well. There's also a very popular topic of latency in the Service Mesh community. So in a traditional architecture, there are envoy proxies that are injected into each application workload. And these are full-up envoy proxies, where the Layer 7 TCP stack must be traversed with each and every request that goes into or comes out of the service. So there are latencies that accumulate as you traverse the proxy network from an application to its envoy sidecar to another application sidecar to that application, and so on and so forth. And these can add up to a few milliseconds of processing time. And when you multiply that by a service network operating at scale, it can add up to a noticeable operational impact. So we'd like to make some improvements in that area as well. Finally, let's consider the cost component of a traditional Service Mesh architecture. So all of these envoy proxies that we're provisioning to run alongside our application workloads consume CPU and memory resources, and that in turn costs money. So one of the phenomena that we have observed at Solo is that enterprises tend to over-provision these sidecars. And the reason why is clear, right? If you under-resource your sidecars, then that can have a very negative impact on the application workloads themselves. And so we frequently people instead providing excess capacity, and that can have a direct impact on the cost bottom line. So we'd like to make some changes to shave some cost from our Service Meshes as well. So these are some of the challenges that we see as enterprises adopt and operate Service Meshes. And so while the model has worked pretty well for a few years, we like to learn from user experience and address the challenges that we've outlined in this sequence here. So one of the ways we're doing that is collaboration between Google and Solo. We've been working on these problems for well over a year now. We actually started working on them separately, discovered that we were working on the same problem separately, then joined forces and collaborated to produce back in September a joint contribution to the community of a new sidecar less data plane option that's called Istio Ambient Mesh. So it's been available since Istio 1.15 as an experimental profile and as of February achieved a really significant milestone in that it's now been merged into the main branch of Istio. So while the community still isn't recommending it for production use just yet, we do expect that to change over the course of 2023. So there's undoubtedly been a wave of interest in the community and Solo in particular is moving really aggressively to deliver education around Ambient Mesh. We'll talk about some options for that later, as well as to integrate it and make it enterprise ready both in the community as well as in our glue platform suite of products. So we anticipate a number of benefits that are going to flow to users in the service mesh community on the basis of Ambient Mesh. Those include things like cost of ownership reductions, simplifying mesh operations, and improving mesh performance. And we'll be exploring all of these in the next few minutes of discussion and demonstration. So one characteristic that you notice right away is that it's much easier to get started with Istio in Ambient Mode. We'll actually show you this in the demo. When you're working in traditional mode, every workload participating in the mesh must be injected with a sidecar proxy. And that can be a pretty heavyweight process at scale. By contrast, the Ambient onboarding process involves simply labeling namespaces to indicate that Ambient is the desired operational mode. In other words, you indicate that all workloads within that namespace will be participating in the mesh using Ambient Mode. And then under the covers, the relevant infrastructure is going to spin up without any changes to the application service pods. Conversely, if you remove workloads from the mesh, it's a similarly simple process where the Istio CNI is simply going to change some routing rules and the Ambient infrastructure is spun down, but the application pods themselves are completely untouched. And we'll be showing you a good bit of that in the demonstration coming up. Let's see. Another point that I want to make here really quickly is that both sidecars and Ambient Mode are fully interoperable from day one. So we want to honor the fact there are a lot of workloads that are out there, a lot of large security organizations who have approved use of sidecars and they're happily using them. And so that's going to be able to continue where you can mix and match Ambient Mode along with the traditional sidecar mode even within the same mesh infrastructure. Okay, so let's turn our attention to how Ambient Mode works under the covers. There are two different sets of components that come into play here. First is a secure transport layer that operates on a per node basis. It handles just layer four issues. Its primary purpose is to manage securing traffic between workloads. That way when you need to apply policies, it includes concerns like routing based on HTTP headers, maybe retries or fault injection. Any policy that requires access to those higher layer seven components of the stack, the enforcement of those policies will be delegated to a separate set of Envoy proxies called Waypoints. You can see them depicted kind of in yellow there toward the bottom. So there are two layers for the new data plane architecture. The first is called a Z-Tunnel that operates on a per node basis and is responsible for layer four and MTLS sorts of security concerns. The second is the Waypoint proxy that operates on a per service account or per identity basis and is responsible for higher level L7 components typically associated with HTTP. But the Istio control plane architecture based on IstioD that you see depicted at the bottom of the diagram, that remains unchanged. It's going to respond basically to the same APIs that it did before, but now IstioD will magically configure the Z-Tunnels and the Waypoint proxies in addition to any traditional sidecars that are active within the mesh. Okay. So there are two layers of the ambient mesh architecture. So let's assume that we just care about the security capabilities of Istio. That's a common adoption pattern that we see in enterprises. New users frequently don't prioritize adopting the entire Istio feature set. They instead prefer taking things one step at a time and focus initially on just using MTLS to secure the communication channels between services within the mesh. For example, public sector clients come to us with a mandate like the executive order issued by the Biden administration in the U.S. It requires government agencies to adopt zero-trust architectures and secure all service-to-service communication with MTLS. And satisfying that mandate is job one. So in cases like that, we can fulfill that requirement using a simplified ambient mesh architecture, using just the components depicted in this diagram with the pernode Z-Tunnel component. So in this case, what you can see, let's say app A wants to talk to app B. There are no sidecars in play here. App A's request is detected to its node local Z-Tunnel, which uses a secure connection with the Z-Tunnel on the node where app B lives. The second Z-Tunnel forwards the request to app B. The request is processed and the response is sent back along the same path. So if security is your primary use case, you can get by with adopting components of Istio that you need without requiring the addition of sidecars or even the new Layer 7 waypoint proxies. Okay. Now, if Layer 7 policies are needed, and oftentimes Istio users do need them, then we introduce this waypoint proxy component. So these proxies are established on a per namespace, per service account, per identity basis, and can be deployed wherever you want, just using standard Kubernetes deployment techniques. You can manage them. You can scale them. You can distribute. You can manage replication of them just like you would any Kubernetes deployment. But from a policy standpoint, nothing really changes. You use Istio APIs to declare the Layer 7 policies that you want applied, and the Istio decontrol plane ensures that the various waypoints are configured to enforce those policies. Okay. I'm going to skip through some of this next part pretty quickly because I want to get to the demonstration, and we were a little bit late getting started, and so we're a little crunched on time. But in essence, the points I want to make here, we see some three primary benefits that people are taking away from Istio Ambient Mesh. Obviously cost reduction is something that always matters because you're replacing a lot of envoy proxies with a much smaller z-tunnel footprint on a per node basis. We've also done some analysis. You can see a link to a blog there if you'd like to drill down on that to understand significant reductions in memory consumption, just in overall resource usage and cost associated with that, which we think is a pretty exciting benefit. Simplifying operations here, we think this takes a big step toward making the service mesh transparent to applications, which has always been a design goal of Istio, but this really makes it more reality because now you no longer are required to have an envoy proxie injected into each service pod within your network. You only spin up these much lighter weight per node z-tunnels, and then if you have layer seven policies that are applicable for a particular service, you can spin up a waypoint as well. So this has the effect of simplifying operations quite a bit. Imagine the scenario of you need to respond to an envoy CVE and upgrade envoy throughout your network much easier. You don't necessarily have to schedule downtime across the whole service mesh in order to enable that. And that's been, I think, from our customer standpoint, has probably been one of the potentially biggest changes that have come along due to ambient mesh. Performance is also something that we see as being a big benefit here. You can see some analysis. We've also done some, if you check out the solo blog, you can see information on performance analysis as well. But what I want to do here, I want to cut over to demo really quickly, and this is going to be a short demo, but I want to give you a sense of what is, what is in play here. So what I have here, this is an instruct environment. So I've spun up a four-node Kubernetes cluster here. We've done a base Istio installation using the ambient profile. So if you take a look at the nodes that are available, or the services that are available, you can see here is what's spun up here. So we have Istio CNI components that are responsible for redirecting the network traffic within each node on the cluster. You can also see there's a Z-tunnel component that is deployed on each node as well. You do not see any sidecars here at this point, and you also don't see any waypoints here at this point because we have not identified, we have not declared any layer seven policies that require them. So we basically, we have a little service network here. Let me show you this diagram. We have a little service network here. It's very simple. It's sort of like an e-commerce application. We have clients that talk to a web API. The web API talks to a recommendation engine which talks to a purchase history service, and we need to make sure that that path is clear and that we can invoke all the services in that path. We're going to use Istio to do that. So basically we have this setup deployed, and in fact if we go in here and we curl this, you can see that we're getting back. These are not real services. These are deployed using the fake service framework. So all you get is basically just kind of a recognition that hey, this is the web API module. It calls the recommendation service which in turn calls the purchase history service. So it's just using fake services, but with the real service chain, invocation chain being involved there. Okay, so let's get out of this setup exercise, and we will go to a second exercise where we're actually going to add services into ambient mode and show you some basic layer four policies in play here. So let's start this up. And what we're going to do to add these services to ambient is very simple. We're going to label the namespace with this annotation here. And when we do that, what's going to happen is that's going to cause the CNI layer to kick in and it's going to start redirecting traffic from the target services to the Z-tunnel that's located on their nodes. You can actually see that. If you take a look at the logs here, you can actually see where there are pods that are being added into ambient mode. You can see some of the redirection rules being activated here to do the traffic redirection. So once that is in place, we're going to have a configuration that looks something like this where we have a client. That client is talking to its local Z-tunnel. It establishes a secure connection to a Z-tunnel on the remote instance where the Web API service is located. The traffic flows through that secure tunnel there and so it goes. So that's kind of the simplified architecture that we'll have in play for our data path at once. So if we test this out here, you can see we're actually putting like 100 requests into this service network. And if you take a look at the logs, these are actually taking a look at the Z-tunnel logs themselves. You can see there's a collection of both inbound and outbound traffic that's happening here. I guess we're just looking at outbound here. But you can also see the spiffy IDs of the workloads that are involved. So basically all of the traffic requests that are going in and out are actually going through the Z-tunnel just like we would expect. Now, let's actually talk about how to set up some... just to see some of the observability. So obviously with the Z-tunnel, we're able to capture Layer 4 metrics. And so you can see some of those right here, taking a look at the scraping the metrics endpoint. You can see some of the Layer 4 statistics that are emitted there. We can also specify Layer 4 authorization policies and have those being implemented strictly at the Z-tunnel layer. So here is an authorization policy. This is going to basically block everything, allow nothing to get through. And so we'll apply that authorization policy. Now, if we try to do the same thing that we did before, you can see that is going to fail. And then we're going to establish a new set of authorization policies that activate just the path that we want through this service network. So you can see, right, here's the Web API. It's only going to allow calls from, you know, like from this sleep pod and from the Istio Ingress Gateway. Similarly with the recommendation pod, it's only going to allow calls from Web API and so forth. So that's not what I wanted to do. All right, so let's go ahead and let's add this authorization pod. No, that's the one we've already added. All right, the demo gods are not smiling. There we go. Okay. So if we add that and we call this from one of the pods that we specifically cleared for access, then you can see this works just as expected. However, if we call it from another location that was not specifically enabled, then you're going to see that that is going to fail. What we're not going to have time to show you, and all of this, you know, if you take a look at the pods that are here, you can see these are the actual pods that we've been playing with in this ambient demo. You can see all of those are, you know, container one of one, right? If this were traditional Istio installation, there would be two containers in that pod, the actual application workload and an Envoy proxy, but we're not doing that. Everything is being mediated through the local Z tunnel. So time permitting, we would show you how layer seven policy works as well. We're not going to have time to do that right now, but what I do want to do is show you some resources that you can, if you'd like to learn more about this. Actually a couple things I want to show you here. So reaction to ambient mesh has been very enthusiastic. This is from Matt Klein, the creator of Istio, talking about how this is really the right path forward for service meshes and hoping that that's going to continue. Also just within the past week, we see that Microsoft is actually throwing its support behind Istio in large part due to the innovations from ambient mesh, which is a good thing. And then finally, if you'd like to learn more, there are a couple of resources I want to point you to. There's a great little e-book that you can download written by a couple of my colleagues at Solo. There's a link to that there. If you have a fairly long flight to get back home like I do, I promise you you can finish that book on that flight. It will give you a really nice overview and kind of help you understand that. If you're like me and you prefer a hands-on approach to learning about these things, then I want to point you to this ambient workshop. You go to academy.solo.io. Everything here is completely free. It's going to spin you up on just the ambient mesh pieces and actually take you through a much more extensive version of that exercise than I was going through just now. And so you can get a full overview there. So I encourage you to do that as well. All right, back to my final thing. Did parsimony win the day in the... Okay, we're out of time. Let me tell you, it was great. I still remember it really well. The little slot card, basically the big behemoth design, they carry it up to the line. They screw up all the potentiometers to make it go as fast as it can. As soon as it comes to the first curve, it goes spinning off into space, and the little slot card just swishes its way to victory. And by this point, the whole crowd were fans of the underdog, the underdog little slot card. So the whole place just erupts in this giant cheer. So simplicity, parsimony, wins the day, David beats Goliath all as well with the world. And I want to thank you. Thank you for your time. If you have any questions, we'll be happy to chat for a while outside. Thank you very much for your time.