 Hey everyone and thanks a lot for joining me in this session today. I'm Naranjan. I'm a software engineer on the Azure Kubernetes Service Team at Microsoft. I'm currently working on the Itzdio add-on for AKS. Previously, I was on the Container Upstream Team working on the Open Service Mesh project. So I have experience working both on managed and open source service meshes. One of my focuses more recently, working on managed Itzdio has been deciding which features do we want to incorporate into our add-on and which features do we want to disallow and how to go about doing so. So in this presentation, I wanted to share some of those lessons to help you untangle your service mesh with feature gates. So first, I'm going to touch on the general problem of service mesh complexity and why users often end up with what I would call a tangled service mesh. Then I'm going to explain how to use quote-unquote feature gates to help solve this problem. Then I'm going to discuss some helpful criteria to help you decide what kinds of features or configurations to allow or disallow in your environments. Finally, I'll conclude by reiterating some of the general takeaways from all this and highlight some recent developments to watch out for. So over the past few years, service meshes have amassed a wide array of features and configuration options. Of course, a lot of this has to do with the fact that they're built on top of the Envoy proxy with the sidecar model, and Envoy is very powerful, and it has a lot of advanced capabilities that operators want to leverage. Obviously, the exceptions here are LinkerD, which has its own Rust-based proxy. And as you may have heard, if you attended the Battle Scars discussion on Tuesday, we also have EVPF and sidecar free service meshes, although not all the sidecar-less models are production-ready just yet, like its deal ambient. And not only can there be so much to configure, but there are often multiple ways of configuring a service mesh. We have installation values, we have annotations, CRDs, and so on. So this example on the top right there is how you define the mesh-wide configuration for a console service mesh. And below that is an example of how you would configure the LinkerD proxy with resource annotations. And for a service mesh like its deal, a lot of these configuration pathways can be overlapping, and I'm gonna be giving a few examples of those later in the presentation. So admittedly, there are clear upsides to having a lot of features. Applications on distributed platforms like Kubernetes have varying and inherently complex networking and security needs, so service mesh functionality inevitably at least needs to be somewhat broad in order to meet those needs. And actually, Christian Posta had a great talk about this for its deal con earlier this year. It's called debunking the its deal as complex meme. I highly recommend giving that a watch at some point if you get a chance. And also, having a lot of configuration toggles gives us flexibility in tailoring the behavior of the mesh to our environment. For instance, we might want some settings to apply on a global level for the mesh, others to apply for particular namespaces, and other policies to target individual workloads. But needless to say, there are some major downsides to having a broad feature set. Service mesh has gotten a reputation for being very complex, despite some important and ongoing efforts in the community to simplify operations. And there's a steep learning curve to learn all the CRDs and APIs, and all of the fine-tuning options. And while having an expansive feature set helps in terms of being appealing to a wide array of users, each organization typically just needs a subset of service mesh features, not all of them. And because of this complexity and having multiple sources of truth, users can end up with what I would call a tangled service mesh, where there are traffic and security misconfigurations and mismatches, and you need to spend hours and hours debugging. The backend developers could have introduced experimental or alpha features into production without the knowledge or against the wishes of the platform administrators. Resource utilization could be blowing up without a clear reason as to why. And earlier, I mentioned being able to configure settings on a mesh-wide or namespace or a workload level, but in some cases, the admins wouldn't necessarily want developers to be able to override certain mesh-wide settings in that manner. And due to situations like that, we can end up with bottlenecks between the platform engineers and the service owners. So the solution to this problem is to quote-unquote feature gate our service mesh. I'll unpack what I mean by this and why I'm using quotes here. So typically, feature gates or toggles are typically viewed as on-off switches to enable or disable the execution of certain codepads. Kubernetes itself uses feature flags in this manner for several features, though there is an ongoing proposal by Tim Hawking to modify this process. Its deal also provides ways of enabling or disabling features through installation values or environment variables for the control plane. For example, up here, I have the environment variable to enable its deal to consume Gateway API CRDs that are currently in alpha. And there is the encapsulated code statement that the flag controls. Likewise, for OSM, open service mesh, we used installation values and the OSM mesh config to provide a way to disable or enable certain features. However, these kinds of toggles aren't necessarily enough. For one thing, they're usually just tied to a feature stage in the life cycle, like whether it's alpha or beta. But as I'll explain later, we might have some other criteria beyond feature status itself that may entail disallowing certain features or configurations in production. Also, it's not always reliable or practical for the project itself to use feature flags in this manner. For example, just based on what I understand from the ITSDO maintainers, feature flagging in ITSDO has been more ad hoc and not necessarily consistently implemented across APIs. It's also worth noting that if a particular feature touches across multiple layers of the code stack, it could be very difficult to track and control the execution of all these code pathways. And that's especially true for something like service mesh where certain features can be very expansive and complex in scope. And as I was highlighting earlier, it's not just feature set, but also configurability that can be an issue in terms of service mesh complexity. So to fully untangle our service mesh, we need to broaden our implementation of feature gating to include some other types of techniques and restrictions. For instance, limiting configurability and reverting configuration drift. And admin may also want to hide some of the error prone configurations from the developers or disallow certain custom resources and notations or fields in the API specs. And in order to put these guardrails in place and ensure that they're not being bypassed, we're also going to be delving into the realm of runtime policy enforcement, which allows us to be fine grained in the validations we're defining. So typically these restrictions and policies would be put in place by the mesh administrators on a platform engineering team, or depending on your organization, you might call it an ops or an infer team. In terms of why an admin won't go about doing this, there are a few reasons. One key motivation could be to simplify operational complexity and make it easier for the backend developers to onboard services onto the mesh. And by decluttering the environment of unwanted features and configurations and limiting it to what's absolutely necessary, you limit the possibilities of misconfiguration. And admin may also want to use admission controllers or policy enforcement engines to ensure that the desired mesh wide settings and constraints are not being circumvented and reinforce administrative control. And finally, as I'll explain later, some of these validations can be tailored towards mitigating resource consumption. So there are several items that mesh operators have in their toolkit for implementing these kinds of features and configuration gates. I'm going to be going through these techniques individually and provide some examples. Feel free to check out my GitHub repository there that I've linked on the slide footers to check out some of these examples or download the slides and click on some of the hyperlinks. I'm going to be using Itzdo, by the way, but I've also included some guides and integrations for Linkerd for all the potential Linkerd users in the audience. So let's start off talking about admission controllers just to brief overview of how this works in Kubernetes. When you create a resource manifest, you get subjected to certain mutations and validations before being persisted to at CD. So the idea here is to use a validating webhook to verify service mesh custom resources. One great solution we can leverage to perform these validations is Gatekeeper, which is an admission controller for Kubernetes that enforces policies through the open policy agent. So the examples I'm going to be giving is how to use gatekeeper constraints written in Rego. However, you could try using other controllers like a Kiverno or if you're feeling adventurous, maybe try writing some policies in common expression language with Kubernetes 1.28. So for those of you who maybe aren't familiar with Gatekeeper, the constraint template is used to define the policy violation and the constraint tells Gatekeeper which resources the template policies should be applied to. So one example of a custom resource we would want to subject to these validations is the Itzdo peer authentication. So the peer authentication resource I have appear deployed in the Itzdo system root namespace enforces that workloads across the mesh only communicate through MTLS. And this is typically good practice after having onboarded all of your applications to the mesh to enforce that there are no longer accepting plain text traffic. However, it is possible to bypass this global strict MTLS setting on a namespace or a workload level. If you take a look at the policy precedent statement here, a namespace peer authentication takes precedence over the mesh wide setting and a peer authentication with a workload selector would take precedence over both the namespace and global settings. So a policy we could write in our Gatekeeper constraint template is to deny peer authentications that could disable global MTLS. So the allowed modes in this policy here are strict and unset. If a peer authentication try setting the mode to disable, it will be blocked by Gatekeeper. Another potential thing a platform administrator may want to watch out for is overriding the global configuration of the proxy. So one way of configuring the proxy in Itzdo is through the Itzdo mesh config. And usually because the mesh config handles the global settings of the mesh, it would typically be handled by the cluster or platform administrator. But as you could see in this bottom snippet, we could also apply resource level annotation to our pods to customize the proxy settings for that workload. And on top of that, we have the proxy config custom resource. And the way that the policy precedent statement works here is that the proxy config custom resource takes precedence over the annotation and both the annotation and the custom resource would take precedence over the mesh config. So if an administrator wanted to ensure that the mesh wide proxy settings weren't being bypassed, they could potentially write the following Gatekeeper policies. On the top there, we are blocking pods that attempt to use the proxy config annotation. And the template at the bottom there denies the proxy config custom resource altogether. So what we've effectively done here is we've limited our proxy configuration to one source of truth and thereby eliminated the possibility that it could be bypassed in this manner. However, just keep in mind that this is just one potential operational pattern. In some environments, you might need the flexibility to configure the proxy on a namespace or a workload level. On top of these examples, some other admission control policies that we could write could be to enforce that sidecar injection takes place, for instance, by disallowing pods that attempt to bypass sidecar injection with the sidecar inject label. Also, when developers are exposing services through gateways or setting routing rules for workloads in the mesh, it's always good practice to be explicit in defining the hosts. So a validation an admin could add could be to disallow overly permissive configurations that attempt to use wildcards. And as I'll unpack a bit more later, we can also block experimental, alpha, or deprecated custom resources. I'd also point out that we could take a shift left approach and do some of these validations at the CI level. For instance, in our CI CD workflow, we can use a CLI tool called Gator, which is specifically designed to verify resources using gatekeeper constraints and templates. There's also Conf test, which is another rego-based CI linter. So the advantage of these tools is that we could use the same policies that we've defined for gatekeeper to validate our custom resources before they're deployed onto Kubernetes. And then we have gatekeeper in Kubernetes as a last line of defense. The next approach to feature gating and configuration gating that I wanted to discuss is to make API abstractions on top of the service mesh APIs. So I like this example here with taping up a TV remote for maybe a family member who's been struggling to use it. But by making an API abstraction, I would say a platform engineer is essentially doing the same thing for the backend service owners. They're masking all of the irrelevant configuration options and only exposing what is absolutely necessary for their specific use cases. So the service owners would just work through a higher level API and then there would be some CI CD workflow or some automated process to convert these to service mesh custom resources. So even though this isn't necessarily an explicit mechanism for blocking features or using feature flags to disable the execution of code paths, the idea is essentially the same. The service owners are blocked from using anything except the specific set of features and configuration toggles that have been exposed and permitted for them. For instance, here's what Salesforce does for their abstraction layer. They deploy its DO resources as helm charts and all the developers need to worry about is just specifying the values.yaml. So we see how the complex authorization policy custom resource and its DO just got greatly simplified to a narrow subset of fields. Another example I found pretty cool was based on an its DO con talk from earlier this year. So Intuit actually uses a declarative UI to configure weighted traffic routing. So the developers would just use this internal development platform and then they would specify the traffic shifting weights and then the platform would create the corresponding virtual service with these weights. The next approach I wanted to highlight is to use GitOps. So frameworks like Flux and Argo are very popular now amongst platform engineers to automate installation and upgrades of infrastructure on Kubernetes. But when we deploy our service mesh components and the custom resources through GitOps, we can define the configuration of our mesh declaratively as well. And this is what is known as configuration as code. So the benefit of this approach is that it allows us to effectively configuration gate the fields and values that we have defined declaratively. Because the GitOps controller is continuously monitoring the desired configuration of the service mesh, it prevents configuration drift by reconciling undesired changes to these resources. So in this example here, I have an Argo CD manifest that's installing the its DO control plane through a Helm chart. Notice here in the bottom few lines, I'm also specifying the desired state of the its DO mesh config by providing these values in the Argo deployment directly. So when Argo pulls the its DO Helm chart, it will pass in these values directly to the its DO mesh config upon installation. So now Argo can monitor changes to the mesh config values that we have defined in the previous step and point out whether the current state has deviated from the desired state. You can see the actual state on the left and the desired state on the right. So for instance, perhaps another mesh admin was experimenting with some fields and they forgot to set it back. Or maybe they were trying to mitigate Envoy's resource consumption, but they didn't realize that setting the concurrency to zero causes Envoy to use up all of the CPU cores as opposed to setting it to unset. But thankfully tracking configuration options through GitOps as such helps us narrow down errors like this and reconcile them back to our desired state. The last option we have at our disposal is role-based access control in Kubernetes. Obviously this is less fine-grained than some of the other feature gating mechanisms and policy enforcements that we've covered. However, it is generally considered good practice to restrict access to the control plane and its DO system, its DO Ingress and its DO egress namespaces to the mesh admins or experts. And we may also want to do this for specific features and custom resources as well. So now that we've looked into some techniques for gating service mesh features and configurations, let's explore some potential criteria to help you establish what kinds of features you want to allow or disallow in your environments. Obviously some of these criteria could be overlapping and some functionalities can touch on several of these areas. So I'd say the first thing to evaluate is the operational complexity of said feature. How easy is it for you as an admin to understand and how complex would it be for the development teams to use? Is there enough documentation or support being offered by the community and the project maintainers? And are there multiple ways of configuring it? If so, as I touched on earlier with the case of the proxy configuration, it's much more intuitive if you could simplify this down to one source of truth. So one example of a feature that's widely regarded as complex is its DO's Envoy filter. Because Envoy filters allow us to directly modify the Envoy configuration, they can be very convoluted and difficult to use. And if misconfigured, they can break traffic across the entire mesh. And as I'll touch on in a bit, the feature status of Envoy filter is alpha, which may be another cause of concern. However, this hasn't necessarily stopped operators from using Envoy filters quite extensively, whether the use case is rate limiting, injecting custom Lewis scripts, extending Envoy with WebAssembly, and so on. So one potential solution here would be for an admin to allow Envoy filters in a more limited capacity. For instance, if you only needed for global or local rate limiting, you could have an admission controller policy that only allows for Envoy filters of that specific type. On top of that, we can have an API abstraction layer on top of the rate limiting Envoy filters for the service owners. Or perhaps we just restrict the use of Envoy filters to mesh admins and experts altogether using RBAC or admission control. In this example, I have a gatekeeper constraint that rejects Envoy filters that are not of the rate limiting type. You could see that in the filter type variable there that I've outlined. And on top of that, I've created an example here of what an abstraction layer with Helm could look like. Don't worry about the individual values there per se. The point is that I've just simplified the rate limiting Envoy filter to a simple values.yaml, limited to just what the developers would need. And because of the way I've set up the Helm charts and hard-coded the type in the manifests, rendering the templates will only produce rate limiting Envoy filters. So the developers don't really have any control over what type of Envoy filter gets created or any other fields in the Envoy filter spec that have not been exposed to them. And by the way, though Helm works for this specific example, if you need something more complex, you'll probably need to write your own CRDs for the Envoy filter. And here, I've linked to an example of a CRD that GoPay has built on top of the rate limiting Envoy filter. So again, feel free to download the slides and check that out later. Another important thing to consider when establishing your feature gates is the status of the feature. So many organizations often have requirements that only beta or stable level features or probably just even stable can only be used in production. So it's good practice to disallow features or APIs that haven't reached that stage in the feature lifecycle just yet. So for example, customizable telemetry production through the telemetry API in Itzio is still currently in alpha. So until that feature has reached a status of beta or stable, you may want to block use of the telemetry API and opt for configuring telemetry through the Itzio mesh config instead. It's also crucial to factor in the risk level and the impact on the overall security of the mesh because security like MTLS and Zero Trust are such core reasons of why organizations adopt service mesh in the first place. The resources and APIs that govern security should be handled carefully. So we've already touched on the peer authentication and MTLS, but there are some other Itzio resources here that pertain to security. So to prevent policy misconfigurations and mismatches, it would definitely be worthwhile to add some fine grain validations for these resources specifically, hide some of the configuration using an abstraction or delegate their configuration to the admins or experts altogether. Different features and configuration options also have important implications for resource consumption. Before I was highlighting how misconfiguring the concurrency field can cause Envoy to use more CPU cores than intended. So another way of preventing a misconfiguration like that, in addition to tracking those values through GitOps, could be establishing an upper and lower bound for those fields in our validating webhook. Another helpful policy enforcement would be to ensure the existence of a sidecar resource as such in the Itzio system namespace. The sidecar custom resource in Itzio can be used to limit the scope of the Envoy config. For instance, the one here says that every proxy in the mesh should only be aware of other workloads in the same namespace or the Itzio system namespace. As opposed to the default behavior where Itzio pushes information about every service in the cluster to Envoy. So limiting the scope of the proxy through the sidecar has been shown to significantly mitigate the memory consumption of Envoy. And after ensuring the existence of a mesh-wide sidecar resource, we can define a policy like such that ensures that it's not overridden on a namespace or a workload level. So the policy I have here verifies that no sidecar resource can bypass the mesh-wide constraint by using a wildcard for the egress hosts. We've also seen some cases where using wasm-based filters in Envoy because of the way that Envoy spins up AVM for the wasm module in each worker thread can also cause resource utilization to spike. With OSM, for instance, we saw that when users had permissive traffic policy enabled, meaning that OSM would populate the Envoy config with all of the services in the mesh and had wasm-based telemetry enabled on top of that, their data plane resource utilization would sometimes spiral out of control. So a good policy enforcement mechanism there could be for an admin to prevent both of those features from being enabled at the same time. It's also worth noting that wasm-based telemetry and wasm-plugins in HDO are experimental or alpha. So that's another reason why they should be used judiciously or restrictively. And once again, we return to the problem of annotations. So if the operators configure a global setting for resource quotas for all the proxies in the mesh, they can be bypassed by using the sidecar annotations on pods. So the admin could define gatekeeper constraints as demonstrated up here in order to deny pods or deployments that attempt to use the proxy CPU or proxy memory annotations. So the constraint on the left defines the disallowed annotations and the constraint template on the right defines the violation for those annotations. And finally, the most important thing to consider when defining your allow list is your organization's specific needs and use cases and reflecting on why you adopted a service mesh in the first place and what is the minimum set of features and capabilities needed to accomplish those aims. And the key to success here is iterative adoption. So let's say you were using itsDO, maybe just start off using virtual services and destination rules, exposing the minimum number of fields to the service owners just to get basic traffic and empty less going. And after a certain point, as you onboard more services onto the mesh, you've built enough confidence with the starter kit and your use cases evolve and become more complex, then you can start expanding your allow list to more features and exposing more fields in those APIs. So just to summarize what we've covered in the session, we first saw some techniques you could use to restrict the feature set and configurability of your service mesh. And here's an example of what this could potentially look like end to end with all these methods put together. So you have your application development team working on an abstraction layer over itsDO custom resources, CI CD workflow to convert these to itsDO custom resources, then they're validated by the CI lenders, pull request gets merged, then Argo will pull the new applications and the new manifests into Kubernetes. And we also have Gatekeeper deployed alongside itsDO to enforce the various policy constraints and gating mechanisms that we've looked at throughout the presentation. And again, Argo is not only pulling the manifest and the helm charts, but it's also monitoring configuration drift and reverting undesired configurations back to the desired state that we've defined in Git. And the platform engineers here would be responsible for creating the abstraction layer and the CI CD workflows. And they would also be managing the infrastructure on Kubernetes and the corresponding policies and configurations for itsDO and Gatekeeper. Again, this particular operational pattern was just based off of my setup that I was exploring. You don't necessarily have to use helm charts or any of the other specific frameworks that I'm using, but it's just a rough guideline. Actually, Mitch Connor's over here actually has some great demos with itsDO, flux, flagger and Argo, including with Ambient Mesh. So definitely check out those as well. And then we touched on some criteria such as complexity, feature status, security and risk level, resource consumption and organizational requirements to help you build your allow list for your desired subset of service mesh features and configurations. Admittedly, one major motivation of this process is mitigating operational complexity, but there is still going to be complexity involved in this feature gating process. The mesh admins still need to do their homework, understand all the various APIs, all the fine-tuning options and then evaluate them against the criteria that I was mentioning. And on top of that, we're introducing new tools like Gatekeeper, Argo and Flux, which have their own CRDs and languages to learn. However, I would argue that the benefits of going through this process and putting time and thought into it is certainly without way the drawbacks. At the end of the day, your organization ends up with an untangled, a decluttered service mesh environment and your service mesh is easier to operate, less error prone. And as a result, we have better harmony between the platform engineers and the developers. This is because we have a declarative framework in place to enforce what kind of features and configurations we want to allow or disallow in our production environments. And most importantly, this allow list has been tailored specifically for your organization's specific set of requirements. Finally, I wanted to touch on some relevant developments in the ecosystem, specifically from the ITSDO community to keep an eye out for. One of the key things I think is worth highlighting here is the ongoing work being done to improve the feature graduation process in ITSDO. So currently, there are a lot of features that have been kind of lingering in an alpha status, but the community has been prioritizing, testing and enhancing them to help them reach a graduated status faster. So as more features are brought out of the alpha purgatory and into beta or stable, you can incorporate them into your production environments with more confidence. And there's also a general consensus in the ITSDO community now that in most cases, it is preferable to break complex APIs like Envoy Filter down into smaller first class CRDs. And in general, to minimize overlap across APIs and narrow down the set of functionalities and fine tuning options to one source of truth. So with improvements like this, the process of configuring and navigating ITSDO will be much more intuitive down the road. And this will be even more true with the evolution of ITSDO ambient mesh due to the separation of the layer four and layer seven components and policy enforcements and configuration, as well as the fact that your mesh is just gonna have fewer moving parts in general. And if anyone wants to download the slides and look into some of these examples later, I have them linked here. I also gave a similar talk about this for ITSDOCon earlier this year. So feel free to check that out as well. So that is it for this presentation. Thank you so much everyone for coming. I hope you found it valuable and can take something away from it. Please scan the QR code to provide feedback. Feel free to reach out on LinkedIn. Happy to talk anything, service mesh or GitOps, Kubernetes, discuss some service mesh operational patterns. There's my GitHub repository in LinkedIn. So yeah, thanks again and please enjoy the last few hours of KubeCon. I'll be hanging around for a bit if anyone wants to ask us any questions. I had a question. So one of the things that I was thinking to myself as you're giving the presentation is there's a lot of places where you can define things and there's a lot of opportunities for misconfigurations to happen across just even your platform team saying like, well if we're defining it and if we're defining the rules in three different places to catch it in CI, to catch it in production, do you have any recommendations on how you might think about like an abstraction layer above that to say like these are the rules that we want to enforce and making sure that they're applied in the right places or has that not been something that anyone has thought through? So if I understood your question it was how to do some of these validations even earlier like through the abstraction or internal development platform? Yeah, I mean, I guess you could add some, I mean depending on how you build your abstraction layer and how like the platform itself is set up I think the example I was giving with Intuit, because they have like a declarative like UI and like a web app. I think in that case it would probably be easier to bake some of the validations into the abstraction platform itself. I think yeah, with the examples with Helm and the other examples I was giving yeah, I think the abstraction itself probably wouldn't be able to block or disallow. You'd probably need to do that at the CI level. Thanks Niranjan, this was great. You sort of had two categories of things that we talked about. One was sort of templating and providing just limited levers for an app dev to pull in terms of configuring their Istio. And the other was actually like rejecting config that violates your rules sort of administrative or policy-based. Are those things that we should be either or like choosing whether you use templating or policy or is it sort of a both and thing that you would recommend? I think, I mean my own opinion, again it's harder for me to say because I'm not like an organization actually like using Istio in production. I'm like more of like a developer but if I had to make a recommendation it would be it's better to have multiple layers of defense. So like you have your abstraction layer and then you have your policy templates like on top of that as kind of like a last line of defense but I think again like it is like a lot of work to do all of them. So like you're using Istio which is complex enough and then you have your gatekeeper which is another layer of complexity and then you have to develop your API abstractions which is a third layer. So I mean depending on like the organization and the specific like needs and requirements of the platform engineering team I think but yeah I would recommend like doing as many validations as you can beforehand. Thank you. With that I'm just gonna turn off my mic but I'll be hanging around for more questions.