 I'm going to get started and introduce myself first. So hello, everyone. I'm Yash Gandhi. I work as a cloud platform engineer at Morgan Stanley. What we do and what we build out is standard developer platforms that are used across all business units within Morgan Stanley. And I'll take you through how we've built the platform. So the talk is largely focused into three areas. The first one being what challenge we're trying to solve, the architecture of how we've built the self-service platform, how we've made it in a multi-cloud manner. And then I'm going to talk about the security aspects of what goes into building such a platform. Then taking the platform that's built, the security that's introduced into the platform, and then how that actually goes on for application developers going on and actually deploying applications on those clusters. So let's take a look at it step by step. So we've got Alice here who works as a developer at Morgan Stanley. Let's take a look at a couple of requirements that she's got from the platform. So all she cares about is her application should be deployed in a production grade Kubernetes environment. It would also be useful if she could do all of the things herself without having to depend on an infrastructure team. And then it would also help if she didn't have to follow different processes for different cloud providers if she had to work with. So, and then over and above that, we are a bank. So it should all follow the firm policy and then meet all the security requirements that are enforced for us as the firm. So let's take a look at what and how we can solve this challenge. So I'm going to walk you through the four parts of how we are solving for this. The first aspect of it involves building out what we call as the landing zones. Landing zones are nothing but a secure, guardrail, cloud provider-based environment where your Kubernetes clusters get built into. The landing zone only provides the fundamental services like your networking, DNS, and then think of it like your Azure subscriptions or AWS accounts or Google projects. That's just the base foundation where your clusters will get built into. Then comes in the pattern, which is the infrastructure's code-based blueprints that we've got that provision a bank Kubernetes cluster view. As with most of you who've used Kubernetes, you know that just a Kubernetes cluster by itself is never fully sufficient. So what we do as part of the pattern is that we integrate it and we also build out the cloud provider-specific storage services, key management services, monitoring, and much more. Then above this infill layer is where we deploy the actual Kubernetes platform, which configures your enterprise and open-source components and makes your Kubernetes cluster production-grade and ready with all security controls and policy compliance. Also, and I'll show an example of how we do that, we've got a common installed template section that deploys all your platform-level components in multiple cloud providers as well as part of the platform. Now, over and above the base landing zone, the cluster that's built, the platform that's deployed on top of it, then comes in your application. So all the applications as a firm standard, we recommend that they are only deployed using GitOps so that if you have to have a consistent developer experience, you can have that because you have to follow the exact same process for any other cloud provider. So let's take a first look at the self-service aspect of the workflow. So what we have here on the left is a client who's crafting the infrastructure configuration in a JSON format once, and we'll take a look at the configuration as well. So once that configuration is created, it's actually published into a data store. What happens next is from that point on, the privileged boundary is changed up from a client's control to the actual automation control where the CICD pipeline could be GitHub actions, could be Jenkins, actually goes to the data store and then pulls down the configuration that was published in. On that package of the configuration, it runs a couple of compliance checks just to make sure that no sort of odd configuration is requested for when building the cluster. And then once all of the checks look okay, the Kubernetes cluster is then provisioned in the cloud environment. So let's take a couple of detail looks into what this looks like. So on the left, I have the example of how the config looks like. So what I have defined as the deployment target, which is where exactly I want to build my cluster into. So for example, if I'm building into Azure, I just provide the subscription name, the region and then the subnet information. Then a couple of properties of the cluster itself, what I want to name it, and then the sort of configuration I want for this particular cluster. Like I mentioned previously, we also have integrated with storage services. So I can also provision disks that I'm gonna use in my applications as part of the cluster build. So the most important part here, obviously, at the end you can also specify any namespaces you wanna get created on the cluster. But then I wanna focus on the platform version that's specified over there. So when customers and clients actually define this entire configuration in the repos that they maintain, they get to pick a platform version that's most recommended from us as the platform team. So every change to this configuration, be it the region, be it the Kubernetes version, be it the platform version, every configuration change basically becomes an immutable versioned artifact which is published into the data store. Then the customers, like I mentioned, they can actually pick the version of the platform that they want to install and configure the Kubernetes cluster with. I'll come to a later point when I'll talk about why that matters and how that matters. So once a cluster is built out on a certain version or a certain release of the platform, we help them with tools like Dependabot to send them automated pull requests whenever newer recommended releases of the platform are available for end users. And then like I showed in the previous slide, all of this is integrated as part of their CICD workflows. So with that, let's move a little into the multi-cloud aspects of the platform that we've designed. So for building out the base Kubernetes cluster from an infrastructure perspective itself, we've got separate Terraform patterns for each of the cloud providers. Like I mentioned previously and like we saw in the config, we've got cloud provider native integration for each of the key management storage and monitoring services built out with the pattern as well. Then come in the platform components as I termed them. So the platform components get installed through common templates. So we just have one template defined for multiple cloud providers and we'll take a look at that in the next slide. So each of these cloud provider agnostic templates get enriched with data about what cluster and what cloud provider is defined in the incoming configuration and get enriched with all of that information and get installed through GitOps. So when a Kubernetes cluster is given off to app developers, it's already built in with all of these features and all of these tools that I'm gonna talk about. So a couple of them include OPA gatekeeper, which we use for enforcing all policy constraints, CSI drivers, which we use to integrate with the storage. That's provision. External DNS, ingress, that's pretty obvious just to expose our services. Fluct CD as part of the GitOps controllers that we deploy, cert manager for certificates and many, many more. So the way we think of it is that we don't just give the end users a Kubernetes cluster. We give them a Kubernetes cluster that's batteries included, which contains everything required to come and deploy their apps onto. So let's take a look at one of these templates that I talked about. So what we see on the left, and if folks who have not seen a Helm release definition before, so this is a custom resource definition of the flux GitOps controllers where you can declaratively define how you want to install a Helm chart. So what you see on the left is the cloud provider agnostic template of how you want to install external DNS on your Kubernetes cluster. And then towards the bottom is the only bit that's gonna get cluster specific and then get enriched as you can see on the right side. So if you're deploying external DNS in an EKS cluster, you'd obviously specify the AWS account and the root 53 zone we're gonna publish that into. If you're provisioning that in Azure, then you would have to specify the private DNS name and the subscription that that's gonna be published into. So on the right side, with all of those details from the first slide for the configuration populated in, we get the definition of how external DNS would be installed on an Azure cluster, for example. With that, I would like to now pivot the conversation from each of these components that I talked about in the previous slide. Each of them get defined as a Helm release like so and get deployed through GitOps onto the clusters. Now, with all of these platform components installed and ready to use, let's look at the security aspects about what we've built it. So as part of being a regulated financial enterprise, we are expected to always have information about what sort of security controls exist on the infrastructure that the business critical applications actually operate on. So we are expected to provide and have this data always available and not really ready two weeks before the audit or something like that. So what we've done is that as part of the same templates, we also install policy controls on the cluster through OPA gatekeeper. On the left is a very simple example of how you would restrict any pods that are running from trying to use a non-read only file root file system as part of their container operations. Each of these policies, they're constantly reviewed and updated on the basis of the CIS benchmarks. As part of crafting the release of the platform, right? We have about 20, 30 such policies and then for each of these policies, we have a set of unit tests that we collect and run as evidence for each of these policy controls actually working as part of the release. So for each of these policies, if a certain policy adheres to how a persistent volume should be created or how a network policy should be created, we have all such unit tests certain with gatekeeper which we collect as part of the platform release. Moreover, if there's a specific application that has requirements, that it needs to elevate to certain privileges to answer certain operations, we have the ability to even provide exclusions from such policies on a case-to-case basis. What I'm now gonna do is introduce the term called chain of custody and then I'm just bringing up the flow from the previous slides and then I've just numbered up with a couple of pointers. So I'm gonna talk about this slide and the next one together. So the chain of custody effectively talks about how controls exist on each of the clusters in what stages of the build pipeline. So the first one being, let's take a look at the first one which is the Infra-Config JSON itself. So let's take a look at that and then come back. So the customers, when they define this Infra-Config, they can only pick from a set of supported platform releases that they are allowed to use on their cluster mainly to stay in continuous compliance. And then like I mentioned previously, every change of configuration is an auditable record. Next comes in the data store itself. So after this configuration is published with a certain version of the platform and the spec of the cluster, that goes into the data store. Now the data store itself is backed by immutable storage which means that once a version is published, that hey, this is the version of the platform that's used for the cluster and this is a version of the configuration that's used for the cluster, nobody can go in and tamper that information. So the data store itself is backed by immutable storage. Then coming on to the CI-CD pipeline which actually takes all the platform versions as an input, the release of the platform and then it goes in and does your tier form, does your platform component deployments. So all of these blueprints and manifests of the platform components themselves are bundled up in a release of the platform which is stored in the data store itself. So even the release itself is an immutable artifact. And then the last bit which is the checks and controls that I mentioned. So every platform version, so we've got the platform version defined. We've got the controls that are there per platform version in place and then unit tests for each of these policies that go into the platform. All of this together helps us answer this one question that at any point of time, I will know which cluster has which sort of security controls in place to give that information back to the regulators when requested for. With that, with all of these aspects now built in which is your platform, your security, let's take a look at the application aspect of it. So like I mentioned previously, we have a standard that we expect all the Kubernetes deployments to be done through GitOps. And then like we saw how the platform components are deployed, that's exactly how even applications are expected to be deployed in the Kubernetes clusters. So every change to even the Helm release manifest of an application also becomes an immutable artifact that can be checked into version control as well as goes to the same data store. So on the left is what you see an example of a demo app that's pulled in from a Helm repository and says nothing but just hello from Kubrick, Singapore. Let's take a look at how this actually ends up on the cluster, the same workflow. But in this case, what we have is that instead of the configuration JSON, this time, the clients, they create the Kubernetes manifests in YAML. They push that to the data store and then as part of the same CICT workflows, this time the manifests up or down, they run through some OPHX which deploys the application on the cluster. Now this time, given you've got the platform and all boots strapped up, you've got the bare cube cluster, you've got flux installed as part of the platform components and on top of that sits your application which is constantly reconciled and monitored by flux. With that, let's look at the last aspect of this which is how do I access this application that I've deployed. So one of the other controls that exist from a regulated perspective as an enterprise is that nobody should have complete admin or a full-time persistent access to your Kubernetes clusters. So what we've done is that we install limited time-bound access per Kubernetes CRD on the cluster for a given user. All of that is integrated with our entitlement management system so that if the question arises that can you access secrets in the namespace demo, if the answer to that question is yes, then only a time-bound access for me as a human will be installed in the cluster only for the namespace demo to read the secret objects. I won't be able to do anything else apart from reading secrets in the cluster. Once the time has elapsed for my access, it is automatically revoked as well. So let's quickly recap for what we've actually built through the platform and what can we offer Alice. So we've built a production-grade Kubernetes environment from scratch. We've enabled end-to-end self-service for cluster builds as was application deployments. We've unified the multi-cloud developer experience for Kubernetes deployments, be it any cloud provider, you don't really need to change the workflow anywhere in the process. And then all of this, while still having all the guardrails in place and all the security and resiliency that's expected by the Kubernetes platform so that you can also comply with regulatory requirements. With that, I think Alice would be happy with the platform that we've offered her. And that's about it. Thank you so much. Questions? Do you need to comply with different kind of regulators? If so, do you separate the clusters depending on the regulators or you only use unified, large clusters which comply with all kind of regulators? Each of the business units that we have, each of them have their own independent clusters but then from a reporting perspective, all the clusters are reported out together, right? As in, we've got 100 clusters, let's say. Each of them are running on what versions of cube itself, what versions of node images itself, what version of the platform, and then are these the recommended ones that they should be on? So all of that data is reported together. And I was curious with your compliance checks. The compliance checks on your last stage of your pipelines, right? Can you give an example of those compliance checks? So that would effectively be the Kubernetes manifests that are incoming to get installed on the cluster. Those checks are, so with Gatekeeper, we do something as similar to how you would write unit tests for code, you can write similar tests for Kubernetes manifests that are incoming. So you can at any point write assertions and violations for, okay, I'm expecting this sort of a pod to cause a violation. I'm expecting this sort of a... Sorry to cut you. What kind of, do you do those automated checks as per the regulatory requirements? Yes, yes. So like I mentioned, the policies that are set, those are as per CIS benchmarks. Okay. Yeah, we install all of those policies and then we constantly keep them updated as per CIS benchmarks. Thank you. Hey, hi. Great presentation. I had a question on the just-in-time access that we're talking about. So how do you do the automation of providing that access and who does the housekeeping of deleting secret access and other things? So that's integrated with our entitlement management system, the one that we have in the firm, right? And then... Is that something in-house? Yeah, that's in-house. And then each of the role bindings that are created, they get created with a timestamp. So then if that timestamp elapses, the role binding is automatically cleaned up on the cluster. We've got a controller that does that. It's a custom controller that you are... Yeah. Okay. Any other questions? Hey. Yes. So for folks who it wasn't audible for, the question was that do we provide self-service and then do people create clusters per application or is it shared? The answer is it depends on each of the use cases. As per the business unit requirements, some of them have a shared cluster where they have just a namespace for trying out things and seeing how a new application functions. But then typically beyond the POC perspective, folks generally tend to go into their own clusters just for the sake of isolation. And then they can auto-scale it up to whatever requirements they have. If it's a small application, just a simple web service, they could just have a tiny cluster that serves it out. Yeah, that's mainly about it. I have a question. Because you mentioned that the customers are allowed to operate and maintain their cluster configuration. Yes. Has things ever gone wild? And then you have all these different clusters with different versions? Or is this something that you enforce with your compliance checking? So like I mentioned, they can only pick from a certain versions of the platform that they can use. A certain, we don't really enforce controls on what sort of nodes they want to use. So if someone's requesting for like a 32GB node and if they want 10 of those, we don't really go into those details. But then if they are using unsupported versions of Kubernetes or unsupported versions of our platform, that's when we just block the build at PR time or at build time itself that, hey, you're not using this particular combination of what's supposed to work. So for example, if I want to build a cluster today with say, 124 or 125 and say, the latest release that I have built out today of the platform. So we always have this combination of what version of the platform supports what versions of Kube itself. So that basically helps us catch such misconfigurations before it goes way beyond. And who should upgrade the cluster? That's up to the application developers themselves. We have five minutes. If there's no other questions, then thank you so much.