 Hey everyone, and thanks so much for attending this session today. I'm Naranjan. I'm a software engineer on the Azure Kubernetes service team at Microsoft. I'm working on the Itzdo add-on for AKS, and I've also made some contributions to the Itzdo code base and documentation. So a lot of my recent work has focused on deciding which Itzdo features to incorporate into our add-on and deciding which features to block or exclude, as well as how to go about doing so. So in this presentation, I wanted to share some of those lessons with you to help you untangle your Itzdo mesh and make Itzdo more manageable and secure with feature gates. So just to overview what I plan on covering here. First, I just wanted to talk about the general problem we're trying to solve and why feature gates are the solution to that problem. Then I wanted to cover how you could go about feature gating your environment and establishing these guardrails, and also touch on some criteria to help you decide which features to allow or disallow in your environments. And finally to conclude, I'm just going to reiterate the primary takeaways and touch on some relevant developments in the Itzdo community related to this. So as I'm sure most of you are already aware, Itzdo has a lot of features. And not only are there so many features, but there are so many ways of configuring them. So here's what I mean. Take plug-in certs. I can pass this into the mesh config CA certificates field or through populating the CA certs Kubernetes secret in the root namespace. Another case in point is setting environment variables for the proxy. We can do this through the mesh config, but we also have resource level annotations as an option. Or we can use the proxy config custom resource. And these are just some of numerous examples of how there could be multiple ways of accomplishing a given task in Itzdo. So there are definitely some clear benefits and upsides to this broad feature set and configurability. For one thing, it makes Itzdo versatile and adaptable to various cloud environments. So organizations using distributed platforms like Kubernetes have vast and complex needs, but Itzdo's broad feature set enables it to meet these various needs. And also having multiple avenues of configuration gives flexibility in terms of how operators go about fine-tuning their mesh. In some cases, we want some settings to be configured on a mesh-wide level, but use custom resources or annotations to tweak settings at a more granular level on a namespace or a workload level. However, there are some clear drawbacks to this. With this growing list of features and the numerous ways of configuring them, we've heard a lot of complaints about Itzdo's operational complexity, despite some very important and ongoing efforts in the community to simplify Itzdo. And part of the reason for this is while the broad array of features makes it appealing to a broad array of users, each user just typically needs a subset of Itzdo features, not all of them. There's also a steep learning curve to become familiar with the Itzdo APIs. With complexity comes the possibility of misconfiguration and policy mismatches, which can break traffic or lead to workloads in our mesh being vulnerable to attacks. And previously I mentioned being able to configure fields at a mesh-wide or namespace or workload level, but maybe there are cases where a platform engineer doesn't want this. They want certain mesh-wide constraints and they don't want those constraints to be circumvented at a lower level. And because of disagreements like this, we could end up with bottlenecks between platform engineers and the service owners. So the solution to this is to feature gate our Itzdo environment. So typically when we talk about feature gates and Kubernetes, they're used to refer to key value pairs that we could use to turn specific features on or off. Itzdo does offer something similar with pilot environment variables or other installation settings. However, as per the Itzdo maintainers, this isn't necessarily the recommended route for enabling or disabling features in Itzdo, though I believe there is some work to improve this long-term. Also, operators might want to look beyond toggling features per se and add additional guardrails, say related to custom resources or specific configuration options or having external policy enforcement mechanisms. So yeah, just to clarify, when I'm using the phrase feature gating for the rest of the presentation, I'm not just talking about enabling and disabling features per se, but also establishing boundaries on configuration options and custom resources as well. So feature gates are often talked about in terms of, say cloud providers offering managed Itzdo, who established guardrails in the mesh beforehand for their users. But this isn't always necessarily the case, right? We could have a mesh administrator using open-source Itzdo who wants to establish their own guardrails, for instance, to limit operational complexity, enforcing administrative control, preventing some common misconfigurations and policy mismatches that could compromise the mesh. Also, they might want to ensure that certain best practices regarding security or observability or resource consumption are being adhered to. So we have several items in our toolkit as to how to go about establishing these guardrails in our mesh environment. We could use Kubernetes admission controllers or even take a shift-left approach and do some of these validations at a CI level. We have role-based access control to restrict capabilities to cluster admins. We could use GitOps tools to prevent configuration drift and enforce configuration specifications for our mesh through something called configuration as code. And we could also develop API abstraction layers on top of Itzdo custom resources to selectively expose certain fields in the Itzdo APIs to the developers. So let's start off talking about how we could accomplish this through admission controllers. So just a brief overview of how this works in Kubernetes. When you publish a manifest to the API server, it'll be subjected to certain mutations and validations before being persisted to at CD. So the idea here is to add validations for Itzdo custom resources. So Itzdo does offer its own validating webhook server, but this mainly does some sort of broader higher level validations to make sure that the custom resources are being roughly defined correctly and they're not any blatant misconfigurations. And I've provided an example in the slide here of what the configuration for the validating webhook looks like. But we might need to add some additional validations on top of these. So adding some other verifications for custom resources, perhaps even blockings of custom resources altogether. There may be some situations that require validating the mesh config or the global proxy config. There are some cases where resources through annotations can bypass desired mesh-wide constraints or even circumvent sidecar injection. So we want to validate those manifests as well. And there are some broader best practices we should implement with respect to pod privileges and resource quotas to safeguard the overall security and behavior of the system. So as an example of a custom resource that might require validation, here's the peer authentication resource. So here I have a peer authentication resource that sets global MTLS to strict by applying this in the root namespace. And usually operators would do this after having migrated all of their workloads to the mesh to ensure that no pods are accepting plain text traffic anymore. However, it's possible to bypass this global enforcement of MTLS. If you look at this policy precedent statement here, if we create a peer authentication at a workload specific level, that would take precedence over a namespace code peer authentication policy, which in turn would take precedence over the global peer authentication policy. So an administrator could enforce through admission control that no peer authentication resource can bypass this global MTLS by changing the MTLS mode to permissive. Relatedly, another validation we could add is for the destination rule. So destination rules control whether the traffic emitted by the proxy is plain text or encrypted. So a validation we could add here is to ensure that destination rules don't override the default auto MTLS setting by disabling MTLS or setting the TLS mode to simple. It's also a good practice to enforce the existence of a deny by default authorization policy such as this one in the root namespace. So this way, the service owners and or the mesh operators would need to apply authorization policies individually to enable workload to workload communication in the mesh. So besides custom resources, another thing to potentially validate is the it's do mesh config. So because the mesh config typically controls the mesh wide settings, it would typically be handled by the platform engineers or the cluster administrators. But in any case, it might still be worth restricting the configurability of the mesh config. Say if operators wanted to establish some guardrails among themselves or set upper or lower limits on some specific fields. In some cases, the mesh experts or administrators might need to open up a subset of certain mesh config fields to developers or certain development teams. Another thing to consider for the mesh config is that certain mesh wide settings that we enforce could potentially be circumvented through custom resources or annotations and labels. So let's say for instance, we have an organization and one of their requirements is that all traffic is logged with Envoy access logging and forwarded to a log analytics workspace. So one way of configuring this mesh wide is through the access log file field in the mesh config. But as you can see in this example here, I have a telemetry API resource that specifically disables access logging in the foo namespace. So mesh admin in this case would want to prevent developers from being able to circumvent the desired behavior in this manner. So one great solution for defining these policies is Gatekeeper, which is a Kubernetes admission controller that enforces policies through the open policy agent. You could write these policies in a programming language called Rego and the examples I've linked here provide some real world examples of how the it's your custom resource policies could be defined in Rego for Gatekeeper. I also just wanted to make a quick note here that in addition to doing these through admission control, you could take a shift left approach and do these validations at a CI level. So you have your CI linters that validate the it's your manifests before being pushed to production. And you have your admission controllers. You're validating web hooks in a Kubernetes as kind of a last line of defense. So another great tool we could leverage to feature gate and set guard rails for our environment is Kubernetes rule based access control. So it's good practice to restrict management of the control plane and ingress and egress namespaces to cluster administrators should also probably limit management of specific potentially sensitive custom resources like authorization policies, peer authentications and so on to mesh admins and specific service accounts. So another way of feature getting our mesh and enforcing desired mesh configurations is through GitOps. So GitOps has become an increasingly popular route for managing infrastructure on platforms like Kubernetes and a lot of HDO users now leverage frameworks like Flux and Flagr to streamline the process of safely deploying and upgrading HDO. But we could also define the configuration of our mesh and mesh policies declaratively as well through a configuration as code approach. So one major benefit of this is that it prevents configuration drift. Even with strong admission control and RBAC mechanisms in place it's not feasible to prevent all untracked changes to our environment, right? But with GitOps we have a controller that continuously watches our infrastructure and ensures through reconciliation that it matches the desired state that we have defined in Git. So for HDO we could use GitOps tools to remediate changes to the HDO installation settings and the Kubernetes resources. We could potentially reconcile changes to the mesh config and the global proxy config values that the system administrators have defined upon installation. We could extend these to HDO custom resources as well and we could reinforce that the aforementioned admission control policies and our role-based access control that we have defined as guardrails still exist and match the expected configuration. So to take this example with HDO and a Flux helm release in my helm release I've defined the configuration for HDOD, the installation values and the Kubernetes settings like auto-scale and memory but I've also set some mesh-wide configurations and the mesh config field here as well that I want HDOD to have upon installation. So when Flux creates the helm release for HDOD it'll pull the HDOD helm chart from a helm repository that I have defined elsewhere and it will pass in these values and install its deal with these settings. In terms of using Flux to continuously reconcile changes to helm resources native drift detection in Flux is experimental for helm and thus not suited for production yet but we could still potentially trigger Flux reconciliations manually through an external mechanism like a cron job for instance. So with this in place any undesired changes to the HDOD deployments or resources or to the default mesh config settings would be reverted back to the initial declared state and configuration that we have defined in the helm release and thus with this we have eliminated the possibility of configuration drift for these specific specifications. So the final technique that I wanted to discuss here is to create abstractions over HDOD APIs. This is already something that's being widely incorporated by companies like Salesforce, Airbnb and Splunk just to name a few. So with these abstractions the developers and service owners they don't need to worry about learning HDOD's CRDs. Instead they work with some higher level APIs and there would be some CI tooling or automated process in place that would convert these to HDOD's CRDs. So in terms of how this relates to feature gating if you think about it you are effectively hiding much of the HDOD API from the service owners and operators are being selective in choosing which particular fields in the APIs to expose to the developers and these fields would have been decided to be necessary or safe to configure. So if you take this example of what Salesforce implements we see on the right we have this authorization policy configuration but this is actually being deployed as part of a helm chart. So these service owners just focus on what you see here on the left which is a values.yaml with a selected set of configuration options for them to configure. So now that we've seen some ways of feature gating our HDOD environment let's discuss some criteria for deciding which features to whitelist or which configuration options to allow or expose to the developers or service owners. So some of the factors worth considering is operational complexity. Also, what is the status of the feature? Is it suitable for production? What are the implications on security? What impact does that feature have on resource consumption or the overall performance of our system? And is this feature a resource necessary for your specific organizational requirements and use cases? So with respect to operational complexity there are several worthwhile considerations. How easy is that feature to understand and configure? If we run into issues with it can we expect adequate support from the community? Is there enough familiarity with this feature in the community to get help with troubleshooting? Is there adequate documentation surrounding this feature on the HDOD blogs or the doc site? And are there multiple ways of enabling or configuring this feature? If so, I would recommend just trying to restrict this to one way. It is always less confusing to have one ground source of truth as opposed to multiple. So a good example of an HDOD feature that is widely regarded as complex is the Envoy filter. This allows users to customize the Envoy config generated by HDOD, say by modifying specific fields or adding filters. And because we are directly modifying the Envoy config this could be very complex and dangerous. If we misconfigure an Envoy filter we risk destabilizing and compromising the entire mesh. But that hasn't stopped the Envoy filter from being widely used. Here are some of the popular use cases. For instance, performing a local rate limiting or running Lewis scripts. So the solution here for instance might be to allow Envoy filters but in a more limited capacity. So if you need Envoy filters for local rate limiting but not any of the other potential use cases you could have a more fine grain validation of Envoy filters in your admission control. Say for instance looking for specific filters in the configuration and blocking what you don't need. And it's also a good practice to restrict privileges of resources like Envoy filters to cluster administrators or mesh experts. Another important factor to consider when restricting features is the status of the feature. So HDO designates features as experimental, alpha, beta or stable. And organizations often have requirements that in their production environments they could only use beta or stable features. So one specific resource where this could be relevant is the telemetry API. Despite the various observability options and configurations opened up by the telemetry API it is still in an alpha status. So this is an example of where we might prefer a more tested route of configuring telemetry through the mesh config or the global proxy config. We also want to take into account the potential risks of said feature in terms of the security of the mesh. So obviously a big reason why people use HDO in the first place is for the security benefits like MTLS and creating a zero trust framework. And because security is so important it might be worth restricting the management of the custom resources that govern security to the mesh admins or only expose a selected subset of these APIs to developers, right? We've seen a lot of cases where we could have a misconfiguration or a policy mismatch like the conflicting peer authentications that I was mentioning before. There are often mismatches between gateways and destination rules and virtual services that lead to common TLS configuration mistakes like double encryption or sometimes even no encryption at all. So if you take this example I have here I have an authorization policy that is overly permissive due to a misconfiguration. It is just a simple extra dash in front of the from statement. So situations like these are a good case and point of why we should add some fine green validations over resources like authorization policy or perhaps restrict their management altogether to the mesh experts and the cluster admins. Considering the impact on resource consumption and the overall system performance is also important. In this example I have a sidecar resource that limits the scope of the Envoy config to only other workloads in the same namespace and we have this deployed in the SEO system namespace to apply to all of our workloads in the mesh and this has been shown to significantly mitigate the memory consumption of Envoy. So one validation we could add here is enforcing the existence of such a sidecar in the root namespace and ensuring that it's not bypassed on a per workload or a per namespace basis. And we also want to ensure for instance that resources can't bypass the designated proxy CPU and memory limits the mesh admin has configured upon installation and this could be done for instance with the sidecar annotations like proxy CPU and proxy memory. So we need to watch out for annotations like these in our resource manifests and potentially block them if needed. Another feature we might need to watch out for in terms of resource consumption is the Wasm plugin resource for generating Wasm based telemetry through its DO. Because the Wasm binary executes in AVM spun up in each worker thread this has been shown to significantly increase Envoy's memory consumption in several cases. It's also worth noting that the Wasm plugin is alpha and a Wasm based telemetry in its DO is experimental. So it might be safer to disallow both in production environments in any case but still both of these are very popular so some use cases might deem it necessary to use them. Finally, when establishing what features to include in your environment it's obviously very important to consider your specific organizational needs and requirements and use cases and to consider for example well why did you adopt its DO in the first place and what is the minimum set of features and configurations to accomplish those specific aims. So I would recommend starting small maybe with some kind of deny by default policy with your admission controller and then as you build confidence as your use cases become more complex you could whitelist some additional features and APIs and configurations. So for instance if you're establishing validations around virtual services or destination roles or creating an API abstraction layer on top of them maybe start out with just exposing the bare minimum to get traffic working in the first place and once you get that working as you gain additional familiarity with its DO you could start whitelisting more features and exposing more fields in those APIs. So just do a quick recap of what we've covered we've talked about some several tools for feature gating and limiting its DO's configurability and why we might want to do so. We've discussed some criteria to help you assess what features and settings to be allowed in your environment and obviously one important take away from all this is that we have completely eliminated the need to do our homework to navigate its DO safely so platform engineers still need to take the effort to understand these features these APIs, their configurations and evaluate them against the criteria that I was mentioning and then deciding at the end of the day what their allow list or exposed value sets would look like. However, I would argue that the benefits of going through this process and really putting thought into it certainly outweighs the cost. At the end of the day you end up with an untangled and decluttered mesh environment that is more secure and much more manageable to navigate. And because of this we have improved harmony between the platform engineers and developers which is one of the reasons we chose to use its DO in the first place because now we have a declarative framework and strong policy mechanisms in place for enforcing the desired behavior of our mesh. So platform engineers can worry less about service owners potentially misconfiguring resources or using undesired features or annotations and developers can prioritize the business logic of their apps instead of dealing with convoluted APIs. And I would argue that the biggest benefit is that when you take this initiative yourself you could design these constraints specifically based on your needs and use cases. Finally I just wanted to touch on some relevant developments in the Istio community that will affect the process of enabling and configuring Istio features down the road. So the one I really wanted to highlight here is the ongoing discussion in the Istio community for improving the feature graduation process. So currently there are a lot of features lingering in alpha but recently there's been more effort to expedite the testing and enhancement process to graduate features to a beta or stable status. So for instance there is a proposal currently for the work involved to speed up telemetry API to beta. So as the Istio community gets more features from experimental or alpha to beta or stable you could start incorporating these in your production environments with more confidence and expect to get more support for these features from the community. And in general I also just wanted to highlight that there does seem to be a preference in the community for limiting configurability to one source of truth. So this would involve for instance moving some configurations from MeshConfig and complicated custom resources like Envoy Filter to separate dedicated first class APIs. And this will make the process of configuring and navigating Istio much much more manageable in the long run. So that is it for this presentation. I hope you found this valuable and can take something away from it. Please feel free to reach out and connect on LinkedIn. I'd be happy to answer any other questions or continue this discussion if you have any additional inputs. So yeah, thank you so much for attending once again and please enjoy the rest of IstioCon.