 Thanks for joining this session on planning the zero downtime life cycle of your service mesh My name is Christian Posta, and I am the global field CTO here at solo.io I've spent a lot of my career helping organizations build scalable distributed systems and Since 2017 specifically working on service mesh technology helping organizations navigate some of the complexities of Choosing a service mesh deploying it operating it in day two and everything in between and I am writing the Istio in action book Full with manning and that should be out pretty soon I think we're getting some of the last updates to the to the publisher right now but like I said, I work at solo and I came over to solo almost three years ago now to work on this application networking infrastructure based on envoy proxy and You know, you'll see a lot of the service meshes are based on this But you know, there's a bigger story here around connecting things, especially in an enterprise environment Across different lines of business across different zones and infrastructure from VMs to containers to You know on premises and public cloud and everything in between and so that's what we focus on at solo we have a lot of expertise in this area in Envoy in Istio and We built our our products to to scale and to work very nicely in enterprises You can see here. We have we have a lot of expertise This is just some of the folks that we have working here at solo New Raj Podar just joined us a few weeks ago Lin Sun both of them on the Istio TOC technical oversight committee and many others that have been involved in the community working with with upstream as well as our our customers some very large some of the largest deployments of service mesh in the world and We were just featured. I would like to point this out real quick We've featured in the service mesh radar report where you're closer to the bull's eye closer to the middle here is Best and you can see solo.io. We've been working very hard and You know that that shows in the in the analyst Observations of us as well. So let's get right into it service mesh as you've you know we've been doing these conferences for a while now service mesh con and the folks that have been coming to that these types of talks are probably familiar with service mesh technology and it loosely defines as you know a framework that Deploys so and allows you to offload application networking concerns from the application to a Little agent or proxy that runs co-located with the application and uses some kind of control plane to manage the configuration and drive the the policies and the the behavior of the network as Observed and enforced by the little sidecars or these sidecar proxies that live with the applications The operators or the end users they configure a control plane control plane ends up configuring the data plane now the service mesh is a very critical piece of application networking infrastructure when it's in place as You can see these proxies live on the data path With the application talking to the network the request are going through these l7 these layer 7 proxies and since that's the case requests are flowing through these proxies configuration lifecycle upgrades these types of things are Incredibly important events and need to be planned for and Shouldn't be taken lightly Because if you start to take down the request path between your various services, you start to see outages and That is not something we want. All right, so before we get into planning for the You know how how you go about managing the lifecycle of a service mesh Which is a very critical component in your in your architecture We should start by understanding some of the areas where the service mesh can fail and things to watch out for so the first thing that we want to talk about is the request path and The configuration path All right, so when applications are communicating with each other The traffic is traversing these data plane proxies Proxies are doing things like routing control Like security transport security maybe things like authorizations whether traffic should be allowed or not and You know a collecting telemetry and so on But as I mentioned earlier that these proxies are driven by configuration from the control plane and so one of the first Failure areas that we'll take a look at is what happens when the proxies can no longer Get configuration updates from the control plane and in an upgrade. This is a very real scenario So we got to explore this this failure domain here Now ideally this doesn't affect the request path you just end up with stale configuration and you don't see new configuration or updates to the system that have happened since the communication went down And modern proxies like envoy proxy and so on can can deal with this and are expected to deal with this in fact the model of configuration for envoy was built specifically to be eventually consistent and Can't survive for some period of time in This particular state So you might miss updates to endpoints and and so on but the proxy can do things like outlier detection and kind of work Around misbehaving endpoints until communication is reestablished. We'll take a look at some approaches for dealing with this and how to minimize this sort of this sort of communication failure Another area that we want to look at is how does traffic get into the service mesh as a service mesh The proxies they can do things like mutual TLS They can do things like service to service Authorizations and enforcing those types of policies But somehow traffic has to get into the mesh before the mesh can start to enforce these types of things some service meshes have a Well-defined way for how traffic gets in through some sort of ingress gateway Istio for example has has this and You know a failure area that we want to watch out for especially when doing upgrades is We don't want to take take down the ingress points for for how traffic gets into the service mesh now another thing to also watch out for is the configuration domains and scopes So you can have you want to watch out for global changes or global configurations that Potentially upon upgrade can take down the configuration of the data plan And so this would impact the request path and whether services can communicate with each other So, you know keeping keeping in mind. What is the scope of your configuration? What configurations are backward or forward compatible and and how to work around and at least Control the blast radius for these types of changes Now there's many other things, you know, those are there's a few of the areas that we're going to explore today but a service mesh is a critical piece of infrastructure that's likely highly integrated with other parts of your stack so things like a Dedicated certificate authority for signing workload certificates or setting up edge ingress Certificates things like integrating with the rest of your observability stack tracing and telemetry collection and so on You know things that don't immediately jump to mind in a you know services service communication runtime environment things like CICD where you might have automation around what configurations live on a cluster for example things like Argo or flux which you know We've seen can kind of fight you when you're trying to do an upgrade if it's not planned correctly ahead of time and cause some unintended behaviors as well as integration with maybe some external or existing API gateways that you will also need to take into account So let's let's take a look at a few of these failure areas and see what we can do to help The first thing that we'll take a look at is getting traffic into the service mesh itself As I mentioned some service meshes have an out-of-the-box gateway that allows you to go from untrusted Unknown to the mesh type traffic to trust it in and inside the mesh, you know ingress inside the mesh SEO is one of these one of these service meshes and That that has a ingress gateway capability So we can't we want to separate out the the lifecycle of how we get traffic into the service mesh And the rest of the mesh and this can be done in a couple of different ways One, maybe you have a completely separate ingress Maybe I use some some different technology Maybe not part of the service mesh or you split it out into separate control planes So that the ingress can be operated and treated differently from the rest of the of the service mesh traffic Istio is a particularly mature implementation of a service mesh and we've seen You know this this sort of model kind of baked into but instead of you know, you have a few different options But you can you can do this through the operator and have different life cycles through the operator by setting up let's say configuration just for the ingress and Configuration just for the control plane have those separated out come completely so that they can be Updated upgraded and so on independently Here's an example of how you might do that using the Istio configurations on the left-hand side We see configuration for the control plane note the bolded areas here You know, we're we're just installing the control plane piece by itself. No other pieces and on the left side on the right-hand side, we're installing just the ingress gateway component nothing else and So we're able to run these in conjunction and have separate life cycles at the operator level Now this will come in handy once we start to upgrade the rest of the control plane because we can do that independently of The of the gateway so we can separate out the possibility of taking down traffic that's coming into the mesh from the upgrade of the data plane inside the mesh itself and so to minimize those those Downtimes or those, you know stale configurations And minimize the connection loss between the data plane and the control plane What we want to do is try to update them both at the same time But we want to do that in a controlled way. We want to do it using Trying to true patterns like canary releases We want to avoid big bang upgrades or in place upgrades that have the potential to to take down The entire system even though they might have, you know, backward compatibility built into into the implementation so for example with With it with your service mesh with again using Istio as an example we can deploy a canary control plane and not affect any of the the data plane traffic whether at ingress or on the east-west side and From there what we can do is slowly Introduce the the relevant components. So in this case, we're going to introduce a Ingress gateway that's tied to the new control plane the canary control plane It won't take any traffic yet, but we're just we're going to start getting the pieces in place so that We can slowly and surgically roll the The the the data plane pieces. So in this case, we're starting to see You know one workload move over to the new control plane Everything still continues to flow traffic flowing into the load balancer and through the original, you know The ingress gateways and we're slowly rolling over the traffic now You can see more of those workloads and now even the ingress gateway is is taking traffic and bringing traffic into the various workloads and then the more traffic we roll over to the new control plane, you know, the more that we get closer to shutting down and eliminating the the the traffic through the older Control plane and if at any point in this process we see we see an issue we can roll back and Go back to the control plane. So this this provides a lot more safety and tries to work around and eliminate some of those areas for potential downtime by by running things concurrently and slowly and methodically in a controlled way rolling the traffic into the the new version Without doing a big bang or in place upgrade The last thing that we'll talk about is limiting your blast radius when doing configuration, you know, well taking into account configuration in a in an upgrade scenario So for example Scoping configuration to specific applications or in Kubernetes to specific namespaces is highly desirable So that's that's kind of step one try to avoid large-scale global configuration in Istio that means, you know, putting configuration into the Istio system namespace with which is an operator or rather, you know platform team user Restricted namespace. We don't want everybody dumping their configs into here Kind of sculpt them down to their Specific namespaces or where applications live where you know, you might have tendency and ownership rules about that You can even use things in Istio specifically to Scope it down. So when you deploy Configurations into Istio, the Istio control plane is watching all of those configurations across all the namespaces and trying to configure, you know, the mesh globally, but what you can do in In Istio, for example, is use the export to Field in the various Configuration objects like virtual service or destination rule and so export to controls Where those configurations get applied? So even if you deploy it in one particular namespace You can say only export the config to this namespace not to everything else If you left out the export to then it would apply to all the namespaces Even if you scope it to a particular namespace here So you have much more fine-grained control over how the configuration is visible outside of a particular namespace Now this export to is a is a list so you can be very specific about what namespaces should observe this particular configuration but you know not narrowing those down and limiting those is highly desirable and Keeping in mind where configuration could conflict with an upgrade like In Istio, there's a break last feature that allows you to override The capabilities of the underlying data plane in this case on what proxy with a With a resource called on the filter. All right, so I'm going to filter also has an export to Field which you can configure to scope this These configurations to these overrides to a specific namespace, but an even more useful Configuration here is Focusing or specifying exactly what version of Istio that this global override should be applied to And and making sure that this does not get applied So if we were to upgrade to a new version that it would only apply to the only older version anyway So You know, these are these are some of the the failure areas and life cycle Things that you should keep in mind when managing operating a service mesh as solo We were working with some of the largest largest service mesh deployments in the world if you're interested in seeing what we More of what we're what we're doing and working with our Customers then or we're building working on service mesh and building the products themselves Please reach out. We are hiring across across the world as you can see our presence is global And we're hiring for People working in the field with the customers people writing the code in the back end Everything sales customer success everything in between So I just want to thank you for joining this session Definitely check out some of the YouTube channels that we have so we share a lot about what we're doing With our customers you can learn a lot about about Istio service mesh in general unboy proxy and Some of the innovative stuff that we are working on at solo. So thank you so much don't hesitate to reach out and Look forward to the The rest of the track here at service mesh con. Thanks You