 Thanks for attending my session at ServiceMeshCon.eu 2021. In this talk, we're going to take a look at adopting a ServiceMesh in an enterprise organization and some of the challenges and things to look out for. This is intended to be a one-to-one, two-to-one level talk. This is not a deep dive or anything advanced. For those, you can take a look at some of the other talks that have done in the past or blogs or books that I've written and one of the others that I've done. So my name is Christian Posta. I'm a field CTO here at solo.io. We work with organizations around the world, large ones, small ones, adopting Istio-based technology and deploying that into production at high scale to build out their application networking architecture. I've written books on this and been discussing this and been involved in the Istio community since the very beginning and although this talk won't be an Istio talk per se, it will pop up since that's what we do here at solo. Our typical challenges are around or customer challenges that we work with are around modernization and going to an application architecture that better supports moving faster and delivering features, delivering things more quickly using public cloud, private cloud to do that. Service mesh is a piece of that puzzle. Service mesh solves the problems of how applications communicate with each other and in a typical organization we see not just the modernization efforts around either building new services, using new tools, new architectures or bringing along older or more legacy systems, model ethic and trying to split them, rewrite them and integrating with the rest of the organization. So how do we connect services in a heterogeneous environment like this and solve the problems of service discovery, load balancing, timeouts, retry circuit breaking, the resilience aspects of application networking, things like security, things like observability and do this in a way that is cloud friendly and what I mean by that is API, API is to programmatically control this at runtime knowing that the underlying infrastructure is dynamic and is ephemeral. So service mesh plays a role in helping to facilitate solving this type of problem. The problem that we're going to look at is deploying microservices into dynamic infrastructure and getting those things to communicate with each other. That itself is a complex problem. Using a service mesh to solve that simplifies some areas of that but it's not a holistic jump in and adopt everything type proposition. So the way we work with our organizations is to slowly adopt that, first of all determine whether or not you need this type of technology, but then to slowly adopt that using a crawl walk run type methodology, but getting this sort of infrastructure into into place can really enable an organization to achieve its technical goals which then leads to better business outcomes and so that's where the fly part starts to come in. So the first thing you need to ask yourself is do you need a service mesh? Well only you can answer that really. Are you building a microservices style architecture? Are you using multiple frameworks and multiple languages? Services themselves, that's a complicated field. So you might hear that service mesh is complicated, Kubernetes is complicated, well we're dealing with a foundationally complicated topic. Now if you're going to containers and you're leveraging some of this dynamic infrastructure that may necessitate a solution like service mesh. If you're doing things like RPC type interactions between your services using different frameworks, making updates to how these policies are enforced or written, you may need to look at some sort of automation that can do this, service mesh can help. And then of course if you're getting out to a large set of services, individual teams owning their own services and you need a platform approach to solving these problems consistently, then you may look at using a service mesh for that. Things to keep an eye out for if you're not there yet is the number of services that you're trying to support, especially if they're heterogeneous, different languages, different frameworks, being able to consistently understand what's happening on the network by capturing golden signals, requests failing, how long they're taking, whether circuit breaking is opening up or how many retries are happening between services and trying to detect what's happening on the network in a consistent way, not leaving it up to each application developer to maybe they exposed the right telemetry signals, maybe they didn't. And then at the end of the day, just like any other technology, you want to pick a handful of high value use cases and start to go down the path and iterate. Things like Kubernetes, things like Istio, they provide a lot of functionality and they touch a lot of different parts of the organization. But the focus of start small and iterate, pick a particular use case. Some of these are the typical use cases that people start with. They want security, they have compliance issues, the data in flight needs to be encrypted, services communicating with each other need to know and validate that they are indeed the services that they believe they are. They need to build things to be a little bit more resilient with timeouts and retries and circuit breaking and so on, do that in a consistent fashion. These are some of the top use cases. So let's take a look at what this journey entails. Let's start with, I don't want to say obvious, but it becomes very clear as soon as you start digging into these types of solutions. There will be a learning curve. There will be a learning curve in terms of how your particular organization adopts this technology, just like maybe you did and went through with Kubernetes. So these types of tools where you're blending infrastructure and developer concerns require different parts of the organization to come together and figure it out together. Ideally you will have a foundational platform in place. You will have automation in place, things like your CI CD or any of the custom automation that you've needed to build to get your platform to work. You would ideally have a telemetry collection system or some place where you're storing time series based metrics and able to do dashboarding and that kind of stuff. Once you start to kick the tires on a particular service mesh or you choose one, then you will want to understand the underlying data plane technology, the proxies that are actually on the request path. It's a very important step in adopting a service mesh. To some, when they start exploring the mesh, the data plane may seem like a black box. In some cases, for example Istio or with Envoy or I guess Envoy based service meshes, there's a lot to glean from Envoy. Envoy is a white box. There's a lot of stuff that gets exposed and you can leverage to understand what exactly is happening on the network. When we work with our customers and users in the community, we typically tend to see either people do this on their own or we recommend it anyway, starting with the edge first. So you want to build a service mesh where services are communicating with each other. They're using these data plane proxies. Start with ingress. Start with one proxy, not 500. And use that as an opportunity. First of all, it's a well understood pattern. Use that as an opportunity. If you put it at the ingress or sort of an API gateway layer, use that as an opportunity to understand the data plane thoroughly, how to debug it, how to troubleshoot it and so on, how to integrate that particular piece and the control plane pieces with your automation, with your observability systems. And that should be kind of the way you tiptoe into adopting a service mesh. So starting at the edge. Again, familiar ingress pattern. You don't, for example, using Istio, you don't need to use the sidecar proxies just to get the traffic ingress routing at the edge. You can just start using edge and route out to any services. Now as you start to refine this, maybe you want to expand this out to different clusters or different infrastructure footprints that you may have, you may find that this proxy or ingress gateway pattern starts to build up a layer or two. And you may isolate different boundaries, either different applications or different clusters or different organizational units following this, the same ingress pattern. And you start to see a mesh of edge gateways start to form. And, you know, from here, now you can start to push to continue to push those proxies down into your infrastructure, and start to get the benefits of the service to service communication. So let me pause here for a second. And let's just come over, take a look at a quick demo. Hopefully everything is set up the way I expect. So what we're going to do here is we're going to maybe take a look at an existing set of services that we may have. We have a Kubernetes cluster, a handful of services that are deployed. In this case, the web API calls the recommendation service, which then calls the purchase history service. So you can see these pods running here in our Kubernetes cluster. We don't have any service mesh installed. We're going to install a service mesh, Istio in this case. And we'll give it a second to go through the installation process and bootstrapping of the of the of the control plane. Let's see if we can do a cube CTO, get hard Istio system and see whether it's coming up should be coming up. Let's also take a look at for this demo, we need to we need our cloud provider to cooperate with us and trying to give us an external IP so we can actually make calls. So what we have installed, if we come over here and look at the top pane, we've installed the control plane for our service mesh. And then we've also installed the ingress gateway. And so this is an example of starting small, right? We have our ingress gateway, we have our control plane. And that's all we really need to get started. We finally do have our external IP. So let's let's carry on. So the first thing we're going to do is we're going to expose one of our services, the web API service onto the ingress gateway. So clients from outside the cluster can call in. So let's apply these resources, how you do it in in Istio. We'll give that a second. And then when we make a call using curl in this case, we should be able to get routed to our service. And we see that that we do in this case, web API calls recommendation. There's some sample services here, which calls purchase history. Alright, so that's, that's all good. We can also do things Istio, for example, our envoy has out of the box, things like jot validation, some other, you know, TLS termination, those types of things. And we can apply that to our our edge gateway and still get started to start to get value from it. If you're if you're trying to go a full blown API gateway where you need transformation, you may need complex usage policies based around rate limiting, you need things like OIDC and LDAP and, you know, more more advanced security features, look at something like a glue edge, which is also based on envoy and delivers more API gateway functionality. So alright, so that's that's the crawl part of our journey. We're just stepping our our putting our toes into the service mesh world. Now, once we've gotten a little bit past that, we get to the the walk part where you want to start giving out or allowing different teams to use the functionality of the service mesh. Like I said, client side load balancing, service discovery, timeouts, retries, circuit breaking, automated security, policy enforcement and so on. So you need to start to identify what what teams and what roles make up the the different groups that we'll be using consuming technology like this and how are we going to expose it to them? Are we going to give them the Istio APIs directly or your respective service mesh APIs directly? In general, those are kind of complicated. Are we going to build a an API your own or tie in with your own existing automation? Right, so there's a handful of things to explore there. When you roll out the sidecar proxies to your applications, you're hoping to do this in a transparent way. You should test that because we've run into so many different use cases where this is not the case application you know comes up first starts to try to make a connection to the outside world. The sidecar proxies not ready yet starts to fail. An application can't handle it better wasn't built for that. So pod recycles. Same thing happens, right? You get this is crash looping or the opposite, right? A service is being spun down for some reason. And the proxy comes down first, but it was there's still live connections going that applications can't handle that. So explore exactly what transparent means to you and your application. These are typically when we're migrating or modernizing. These are typically imperfect cloud applications and you know figuring out how the service mesh will behave with those applications is this part of the walking part of this journey. Like I said, iteratively take advantage of the features, start to roll out the service proxies, enabling things like telemetry collection. So your Prometheus, your data docs, scrape the telemetry from these proxies and start to build security policies. And all the while build up your skills around debugging the configuration issues that you might run into. Or you have the proxy in place and things start to go wrong. Things are slow. Things are failing. How do you actually debug the network in that case? And then of course any day to operation of a service mesh includes being able to upgrade that service mesh with zero downtime. So let's take a very quick look at how we might roll out a service mesh into our existing services. So we have our existing services. We don't have any service mesh deployed, no sidecar proxies deployed here. And what we're going to do is if we call from the sleep service to the web API service, we see that it works. The communication works here. So now what we might do if we have automatic injection available for our service mesh, we may start to slowly and iteratively, notice this common theme, bring workloads into the mesh. And so in this case, any time we make a change to our application, we want to treat it very carefully. We want to treat it as a canary or do a slow rollout. In this case, we're making a change to our application because we're injecting the sidecar. Maybe we haven't made any changes to the source code, but we've injected the sidecar. So now we have in this case two replicas of the web API service, but only one of them has the sidecar. So in this case, we would do some tests, do some smoke tests, make sure that everything continues to function and look fine. And it sure does. So then we'll continue rolling out the various services that we have. In this case, we'll speed it up. We'll speed it up so that all of the services end up slowly, we're fast forwarding what would otherwise be a slow canary rollout of every service where we slowly introduce the service proxy. And we follow the same model for doing things like enabling mutual TLS. There's another thing that we run into with some of these organizations where they have services communicating with each other and they want to enable mutual TLS. And for the services in the mesh, we can do that with things like a permissive mutual TLS policy where if both sides are able to do mutual TLS, then we'll do mutual TLS. Otherwise, if there is our legacy services that are plain text that we'll still continue to accept plain text. But then there are things you can do to slowly roll that the plain text services out and phase them out. And you can do things like monitoring and checking are there any services communicating over plain text. And then at some point say, all right, no more plain text, everything's going to be mutual TLS, everything will be encrypted in the system. But that's, again, a slow phase rollout. Let's come back here. So that's the walk, right? So we we've laid some of the foundational pieces. We've started to roll out the proxies. Now, we want to we want to start moving want to get this out into the organization. It provides a lot of value to the organization. But we can't leave behind some of our existing policies. We can't leave behind some of our existing infrastructure like VMs. And we can't forget that out of the box, a service proxy might not do exactly everything that a particular organization needs. And then there's other things, right? With that, we have regulatory reasons to build out multiple clusters or isolate certain groups from others. And, you know, doing this in a secure fashion are all parts of, you know, now we're expanding. Now we're growing. Now we're bringing the rest of the organization into into this platform and doing things like deploying psychoproxies into VMs, treating those as first class citizens in the rest of the mesh. When traffic leaves leaves a mesh, forcing it through a particular egress point where you can apply policies and security extensions and these other things. And then as I mentioned, extensibility in the proxy is key. And for envoy based service measures like this, you can do something you can write customizations to the proxy using WebAssembly. That's what we're going to take a quick look at here. So if we do what we're going to do here is we're going to augment or enhance the data planes in a particular request flow with WebAssembly. We've written a WebAssembly module that basically enhances the headers in a response to a particular service. So if I call a service in our cluster from sleep to let's say the review service, review service gives us some JSON response, we can see the response headers look like this. What we want to do, let's say some team expects a hello world header that needs to be returned in the response. Let's take a look at what that would be or how we would do that. So with WebAssembly, we can do that by packaging up so we can write the extension in whatever language we want, package that up into a WebAssembly module, deploy that into some OCI registry and share it with everyone. And then you need a way of pulling it down and installing it into your service mesh. Here's one example where we've already tagged a existing WebAssembly extension and now what we're going to do is we're going to install it into a particular workload. In this case, we're going to select the cluster and the workload where the data plane might be running and we're going to install our WebAssembly module into that. So let's cross our fingers and hopefully I set this part of the demo correctly. Now I don't remember. All right, so we did that. Now let's check the status of this descriptor. All right, I think it looks okay. If we look at the status, we can see that it has been deployed. So now if we try to call that same service again and show the headers, we should see that indeed the response has been enriched with this new capability. And so using WebAssembly to modify the request path or modify the messages in the request or do things like implement backwards compatible security protocols and so on that your organization might have is extremely powerful and allows you to continue to scale out and roll the mesh out to various parts of your organization. The last part, now we're running, but the last part is now you have this mesh. Now it's, you know, you're bringing in the VMs, you've got your security policies in place, traffic is, you know, leaving the mesh through egress gateways, correct? Everything looks good. Now what can you, what can you do? What does that give you? I give you a tremendous amount of power in your application network. So how services communicate with each other is not tied now to some centralized API management system. Everything is decentralized. It's not tied to big expensive hardware load balancers and difficult to change DNS and the tied to IP addresses and so on. When services communicate with each other, they can communicate with a global name and across clusters across VMs, across clouds, the mesh is smart enough to say, all right, when service A is talking to service B, prefer service B locally or, you know, service B is a little overloaded. So fail over to the next available zone or fail over to a different region without the client having to know anything about this. All right. So now with federated, by treating this state of different service meshes as a federated unit, you can get extremely powerful network and routing and then start to build things on top of that. And of course, you want to expose this and allow all of your teams to leverage this capability. And so doing that in a way that's isolated and secure for various tenants in your organization is extremely powerful. So the use case that I'm describing as you made multiple clusters, multiple different types of infrastructure, but your services, they're simple. You focus on the business application or the business logic. What happens when you make a call on the network, service discovery is happening, client side load balancing is happening, global service discovering failover is happening. And you also get the timeouts, retries circuit breaking, all that stuff that, you know, augments and enhances an application and makes it a better behaving citizen on the network. And so this is extremely powerful. So this is what we work with our customers at, here's an example of what a deployment might look like to enable that, or you have multiple clusters, maybe VMs, and you're getting consistent configuration, consistent security policies, consistent service discovering, global load balancing provided by this federation. And so this journey that I just described is exactly what we work with our customers, our open source users here at solo.io. And so what I've just described comes from experience, some pain as well. And, you know, this is stuff that people are starting to learn themselves. And we're seeing more and more successful deployments of service mesh at very large scales, hundreds of thousands of clusters and control planes. So that's all I have. Thanks for joining my session. Come chat with us. Go to the solo Slack, there's a great place, a lot of resources about Envoy and Istio and some of the open source projects that we're building, and the tooling that we're building on top of these things to simplify the usage and the operation of this type of technology and help you be successful. So I appreciate, again, you watching my talk. There's a lot of other good talks, so go ahead and see them and have a good day.