 Thank you everyone for coming for this session. Good morning, good afternoon, good evening, depending on where you're logged in from. And it's so good to be back at KubeCon in person. The last one that I attended in person was at San Diego in 2019. How many of you were there in San Diego? Wow, this is great. Good to see you repeat audience. And how many of you are actually aware of ARGO and cross-plane projects? That's great. That's a really good note to start on. So let's get started. My name is Vikram and I'm a senior architect at Adobe. I work in the developer platforms organization. And these days I'm actually working on introducing advanced capabilities into our internal developer platform. And these capabilities include GitHub Space CI CD. It also includes GitHub Space infrastructure provisioning and also building an AIOps foundation for an internal developer platform. With me, I have Manabu from AWS. Manabu, what? Hey guys, my name is Manabu Makulaski. I'm a solution architect for AWS. And I focus on open source technologies, especially on infrastructure tooling. Excited to be here. Cool, thanks Manabu. So today we're going to talk about how Adobe it enabled GitHub Space infrastructure provisioning and application rollouts in a secure and a multi-tenant way. Let's get started. So if we have a really packed agenda today, we're going to start by talking about the pain points, needs and requirements. Thereafter, we're going to have a 10,000 feet overview of Adobe services landscape. And we'll do a quick overview of the internal developer platform and then dive right in into the deployments using our projects and infrastructure provisioning using cross-plane and Argo. Thereafter, Manabu is going to walk us through the multi-tenancy and security requirements and how we solved some of these requirements, you know, with the existing tooling that we had. And finally, to wrap it up, we're going to compare the developer experience with the previous and the new states and talk a little bit about the challenges and the unknowns. Okay, let's get started. So let's just quickly have a look at the previous state and the pain points associated with it. So if you're the service developer, you would be able to resonate with a lot of these. And we have been talking to a lot of service developers as part of our outreach to the clients. And we realize that each of the teams are doing infrastructure provisioning in their own way. They do need infrastructure provisioning, but the platform does not help them. They have custom tooling. They have a lot of learning curve related to, you know, this infrastructure provisioning. And they track the infrastructure and compute resources separately. And if you're part of the platform team, you know, you'd be able to relate to some of these. You know, we, the platform team does not have visibility, observability or auditability into the infrastructure resources provision. And we really have a hard time troubleshooting some of these issues that the teams face. So, you know, when they run into problems, one of the first questions is like, okay, what kind of infrastructure provisioning are you doing? How you're connecting to this resources and it becomes really hard for us. So not only that, you know, when we were talking to the clients, it's not just that, you know, we were encountering these problems in real life. We also did a quick survey, you know, with the teams to, in order to figure out, okay, you know, are we, you know, actually thinking about it in the right way? And actually the results were really overwhelming. As you can see on the screen, around 90% of the developers, they were asking for a templatized solution from the platform team and around more than two thirds of them, they wanted the infrastructure provisioning solution to be integrated with the existing gate ops workflows and they wanted it as soon as possible. So there was a customer need and we decided to fill it. So based on the requirements and the discussions that we've had, we came up with a list of requirements for our solution that we're going to talk about today. The requirements included standardization at Adobe level, of course. It also, we wanted also the solution to be Kubernetes native or the platform itself is based on Kubernetes and, you know, the new solutions that we're building, they're all, you know, the base sort of gate ops. So we wanted to make sure that, you know, this newer solution is also gate ops friendly. Multitenancy and security, as we talked about, are, you know, key requirements for our platform and in general, the focus of this talk as well. So we're going to talk in detail about these requirements. Also, we are running on top of AWS, Azure, and our own data centers. So multicloud is definitely a key requirement for us for our platform and as much as we'd like as a platform team to be able to, you know, work on everything and, you know, create all the solutions to the extent possible, we know that as platform team, our resources are limited. So we want to create these solutions in an extensible way so that, you know, once we build the foundation, thereafter the individual service teams can actually, or the community that we have, it can actually contribute to the solutions and extend it. And finally, we want industry alignment. We don't want to be, you know, creating solutions and silos, be able to contribute back to the community and also learn from the community as well. So with that, let's have a quick look at Adobe Services Landscape, the 10,000 feet overview. So at a very high level, we have these three clouds, as most of you might be knowing. We have Document Cloud, we have Creative Cloud, and we have the Experience Cloud. Needless to say, these are composed of various Adobe's products and services. As you can see, a lot of you would know about Photoshop, Illustrator, InDesign, Acrobat, Lightroom, and so many other products that we have in our portfolio. And these products are, you know, using a lot of these platforms underneath, like Content Platform, Data Platform, and AI and ML Platform that we have built over the years. So what these platforms have in common is we have Adobe's internal developer platform that they are running on top of, okay? So this is, you know, the foundation piece, or, you know, the foundation layer that everything runs on top of, which includes the products and services directly, and also the platforms. And in turn, you know, our internal developer platform, it runs on top of AWS, Azure, and Adobe's own data centers, as we talked about. So let's do a quick overview of the developer platform itself. What you see over here is a lot of boxes, I know, but these are actually capabilities of the internal developer platform. And some of these, actually, a lot of these capabilities are not really specific to Adobe at all. If you look at it and read through some of these capabilities, you'd be able to relate to all of them because they are independent as such of, you know, any Adobe-specific stuff. But, you know, the key thing over here is that it's divided into three different development phases and there is a certain color coding going on here. In yellow, we have the discover and create phase to help teams to get started on the platform and some of these capabilities help in that way. And these are mostly day zero activities. What you see in green color over here is the integrate and deploy development phase. And this helps teams with, you know, getting the application up and running on top of the platform. So you can think of these as the day one activities. And finally, we have something, the boxes in blue or the capabilities in blue, which are related to operations and improvements. And these are helping teams with the management and maintenance of their application on the platform. And you can think of these as day two activities. So the focus for today is going to be these three capabilities, infrastructure provisioning, delivery and deployment, and workflow orchestration. So let's see, you know, how we're going to achieve that. So let's talk about how we have organized our infrastructure in a hub and spoke kind of a model before we actually go into the architecture of the solutions themselves. When you talk about the hub, at the center, we have a hub cluster. The hub cluster is one where we have the tenant hub namespaces, you know, and there's one tenant hub namespace per tenant that is allocated. And this is where we are running our CD, our events and our workflows. And alongside that, we have crossplain installed on the hub cluster. As far as the spokes are concerned, we have a fleet of multi tenant remote clusters that the hub is connected to. And, you know, we have the tenant remote namespaces on these clusters. So one tenant can have multiple tenant remote namespaces, as you can imagine, depending on what kind of like, you know, clusters they want to deploy to and, you know, how many clusters they want to, you know, deploy to. We also have ARGO rollouts installed on the remote clusters in order to facilitate the, you know, in order to facilitate the advanced deployment capabilities. So alongside being connecting to the remote cluster, the hub cluster is also connected to GitHub corp where all the code lies for the various applications. So the hub cluster is kind of choreographing the deployments between Git and the remote clusters. Besides these, we also have the tenant-owned cloud accounts, which could be an AWS account or a tenant Azure subscription for that matter. And we do provide a way for us, for the clients which are running on a remote clusters to be able to connect to these resources that are provisioned in the cloud accounts. So with that overview of the hub and spoke model, let's look at how we are solving the deployments-related problems using the ARGO projects. So we start with where we left off in the previous slide. We have the corporate GitHub. We have hub cluster. We have the remote clusters. We have the AWS account. One thing to note in this diagram specifically is that on the top right, we have resources which are pre-provisioned in the tenant-owned AWS account. And we're using AWS account as a reference, but you can imagine that it could be an Azure subscription also for all intents and purposes. So, you know, in the hub, in the corporate GitHub, we have the client repos. And this includes the app core. This includes the Kubernetes manifest or the hem charts and also the ARGO manifest in the client repos. And then we have a shared workflows and events repository which contains shared ARGO manifest that can be referenced by any other ARGO manifest. For example, the ARGO manifest that are in the client repo, they refer to the shared workflows and events in order to promote code reuse. Okay. So in the hub cluster, we have something called a provisioner or provisioning workflow. And this is common for all applications running in the hub cluster. It works as, you know, kind of an admin. And the job of this provisioner is to be able to provision a lot of these resources that are required in the hub cluster, which includes the tenant hub namespace, also the deployment workflow and events in the tenant hub namespace and multiple, creating multiple RBCD applications in order to, you know, map the folders inside the client repos to the tenant remote namespace, for example. So with that, let's look at, you know, how the end to end workflow works. So we start with, you know, any changes in the app code. They trigger the provisioning workflow and events, the provisioning workflow, we kick in, it will do its provisioning. As we talked about, for example, if it needed it will create more RBCD applications. It will also modify, you know, any, anything needed in the hub namespace. And thereafter, it will invoke the deployment workflow. Okay. So once the deployment workflow is invoked and of course deployment workflow is based on ARGO workflows, so it is a series of steps. The first or the second step might contain, you know, the steps for building the images, scanning the images and thereafter, you know, what it does is it writes to the gates manifest folder inside the Git repo, you know, and then it updates the gates manifest folder with the Git shock or responding to the deployment. Now, one of the RBCD applications that was created by the provisioner, it is listening to any changes in the gates manifest folder and it gets triggered, you know, as a result and it applies the Kubernetes manifest to the tenant remote namespace. Okay. So the tenant remote namespace, you know, any changes that are applied, the Kubernetes resources, they come up as you can imagine and you know, they are able to access the resources in the tenant don't AWS account. Also worth noting is that the Argo manifest folder is mapped via RBCD to the tenant hub namespace. What this means is that any changes to the Argo manifest folder, they are applied to the tenant hub namespace and for example, you know, you want to modify the deployment workflow itself, right? You know, you have five steps. You want to add, you know, two or more steps to the workflow. So you, you know, modify the Argo manifest folder, you know, and commit your changes there and the changes are applied to the tenant hub namespace and then the next deployment, you'll have the modified workflow kick in. So everything seems like, you know, to be working great over here, right? So what is the issue? The issue is that the resources are actually pre provision in the AWS account, right? I mean, we don't want these resources to be pre provisioned or to be provisioned in a different way. We want this to be integrated with the overall workflow. That's why we hear. Okay. So let's see how we can solve this problem using cross plane. But before we go into, you know, cross pain, let's quickly revise some of these cross pain related concepts because they can become a little bit confusing as we go into the next slide. So here we have like, you know, two distinct sections over here. On the left side is a tenant and application concern. And on the right side, we have the platform concern, as you can see. So everything starts with the concept of a composite resource in the cross plane world. But it is an abstract concept. How is it defined? It is defined by something we call as composite resource definition or XRD for short. And it's a cluster scope resource, as you can see, as mentioned over here. So the platform team defines this composite resource definition on behalf of the clients, not on behalf of the clients, but you know, for the clients so that they can refer to it. And also the platform team defines one or more compositions. So right now we're showing a composition for AWS, but they could be a composition for Azure or they could be composition for GCP as well. So alongside, you know, what the composition really does is it actually composes multiple managed resources from the cloud provider. And in this case, we are talking about AWS being the cloud provider over here. So it composes the resources and that's how cross pin knows, okay, these are the resources that need to come up when the composite resource is invoked or instantiated. The platform team also defines something called a provider config. What the provider config really does is it provides access to the tenant AWS account. So the platform team will take the credentials from the client teams and create a provider config for them and then attach it to the service. And that's how, you know, cross pin will sort of like, you know, figure out, okay, this is how this is the AWS account that I need to provision the resources into. So there's also this concept of a composite resource claim, as you can see over here, which is actually namespace book. So this is an important thing because the tenant app, it's actually only using the composite resource claim, as you can see on screen. And everything to the right is abstracted away from them. Everything to the right on in the platform team concerns is totally up to the platform team how they define it, but what the tenant team does is to be able to author the composite resource claims. And as part of authoring the composite resource claims, they specify what is the composition to use, what is the provider config to use. So this is how a sort of cross pin figures out. Okay, this is the composition that has been attached to this composite resource claim. So let's make sure that these resources which are attached to the composition are brought up. And this is the provider config. So this is the AWS account where I need to provision the resources. And once the resources comes up, you know, we the tenant app is able to access the resources. So with that overview and a fresher about the cross pin concepts, let's look at, you know, the actual workflow using cross pin and Argo. So we start off where we, you know, left off in the, in the Argo related workflow. We have corporate GitHub with the various folders over there. And we have the hub cluster where we have provisioning workflow and events and deployment workflow and various RBCD applications. We still have the remote clusters where Argo rollouts is there and the tenant remote name spaces are there. But one thing to note over here is that the tenant AWS account does not have any pre provisioned resources as you can see. Okay. So what is the difference between this and the cross pin related functionality that we're adding? We have a cross pin manifest folder now in the, in the client repository, which is containing the composite resource claims as we were talking about in the previous diagram. So that is what the client tenants use and that is what they add to this cross pin manifest folder. And also there is a shared infra resources repository. As you can see at the top left, which is a report which contains the composite resource definitions and the compositions published by the platform team for the popular, you know, composite resources that are in use in the organization. And in the hub cluster itself, we have the cross pin installation. Right now we're only showing Azure provider and AWS provider tagged along with it. But as you can imagine, cross pin supports many providers and those can be attached to it as well. So with that, let's quickly look at, you know, how the workflow, you know, alters itself to be able to do the infrastructure provision. I think it's, it's also. So again, the workflow starts with any changes to the app code. They trigger the provisioning workflow. It will do its thing. You know, the provisioning related to the namespaces or deployment workflows and all of that. But alongside that, you know, there is an additional job that it needs to do now, which is to be able to look at what are the composite resources that are being requested by this application. And if those are not really available on the hub cluster, it installs those on the hub cluster. So we, in a way, dynamically adding those resources to the hub cluster rather than all of the resources being there on the hub cluster from the word go. So this is helping with keeping the number of CRDs to the bare minimum on the hub cluster. Okay. So once that happens, the provisioning workflow again invokes the deployment workflow and, you know, the rest of the workflow is like, you know, pretty straightforward. We talked about that in the previous algorithm related slide. The Kubernetes resources, they come up in the tenant remote namespace and, you know, everything works as expected. And we also talked about how the Argo manifest folder is mapped to the tenant hub namespace. So any changes to the Argo, you know, folder, they are, you know, applied to the tenant hub namespace. For example, deployment workflow needs to be modified. But the additional thing that is happening over here is that the cross blend folder is also mapped to the tenant hub namespace. But this means is that, you know, any time a claim is added to the cross blend folder, it gets applied to the tenant hub namespace and from there, cross blend picks it up. It says, I can process these resources. Let me make sense out of it. Let me make sense of this claim and figure out what needs to be done. It peeps into the claim, figures out that, okay, this is the AWS account where it needs to provision and these are the resources where it needs to that need to be brought up and it does the provisioning. And once the resources are up, you know, the tenant containers are able to access the resources and party time, everything is great. So, so with that, I think, you know, this is how the end-to-end kind of workflow works. But there are also these multi-tenancy and security requirements that we need to go through. And at a very high level, we have divided into four categories. Let's look at, you know, these categories. One, legitimate access, a person is supposed to have access, should have access to, you know, the resources at all times. Second, there should be namespace isolation. So our platform in general does not really allow a tenant to create cluster scope resources by the tenant teams themselves. So the client should be using the claims in a namespace scoped fashion. So that's something that we are trying to sort of enforce. And also, any drug actors and, you know, as you can imagine, they need to be stopped from exploiting the system. And finally, performance. The existing and the newer workflows, they need to be performant. So some of you might be knowing that Crospin adds a lot of CRDs to the cluster and that exposes some Kubernetes problems, performance-related problems. So we need to make sure that those performance problems are not really, you know, impacting any of the clients or any of the workflows that they have. So with that, I'll hand it over to Manabu to walk us through the various multi-tenancy and security requirements one by one and go deep into them. All right. Thanks, Vikram. So let's go into all the details about these multi-tenancy and security requirements. Before we actually agree into all the details, just the rest of the talk, we are just going to say that provider config is equal to an AWS account. Now, strictly speaking, that's not true, right? It's more of a AWS I am wrong than an account, but for the sake of simplicity, I'm just going to say it's an account. Okay. First requirement is about legitimate access. So tenants, they should be able to access whatever resources they create and in their own accounts only, right? So whenever we give access to another entity in AWS, we should always limit what it can do. So in this case, we're just using AWS roles, policies, all that to just minimize the permissions that cross-chain needs, right? You don't want to give cross-chain permission to create, you know, AWS I am user or anything like that, right? So you should always, always limit that. And also, in Adobe's IDP, they generate this external ID string. This is a string that gets passed to the tenants and the tenants specify that in the I am role and then make sure that whenever they want to assume this role and that external ID string must match. Otherwise, it will just deny that. And then in addition, we also leverage Argo CD's RBAC to just prevent tenants from seeing other tenants resources within the Argo CD UI as well. All right, so next requirement is namespace isolation, right? Like Vikram was mentioning, these provider config, they need to be mapped to their namespaces. What I mean by that is when someone from like namespace 2, they should have access to provider config 2. But if they want to try to use provider config 1, that should not happen, right? Because these are different accounts. There's a problem though, right? Provider configs, they're actually cluster-scoped. So it's becoming a little difficult to confine them into a namespace. How do we solve that? We use... Oh, before that, sorry. So what that means is that if someone from your namespace 2 is to specify that provider config 1 in the manifest, cross-frame is just going to use that provider config for you even though you're not supposed to. Because there's no concept within cross-frame about matching these provider config to a namespace. So to solve that, we use the composition process. Composition like Vikram mentioned, it's pretty much just a template that glues different managed resources together. And when you're gluing all these manages together, you can modify certain fields, right? Actually, you can modify any fields in these managed resources. So if you look at the bottom, you know, YAML file there, we are saying that let's take the name of the namespace and then do some transplantation to it, right? And then use that as the name of that provider config. So if someone comes in and they say, hey, I want to use provider config 1, the composition process is gonna say, hey, no, you know, that's not where you're supposed to go. So it's just going to the provider config 2 instead of going to provider config 1. Now, there's another problem with the namespace isolation. So they should be able to use any managed resources, right? So managed resources in AWS terms, they, you know, they correspond to AWS services, right? Like S3 buckets, WDB table, et cetera, et cetera. So, you know, tenants should be able to use any of that. Now, again, the problem is that it's a cluster-scoped resource. Like Vikram mentioned earlier, it's, you know, tenants are not supposed to be able to provision cluster-scoped resources. So there has to be a way to prevent, you know, users from creating cluster resource scopes. But at the same time, this should be all be available. So to solve that, cross-playing has this mechanism, right? So it's a composite resource and composite resource claims. Claims are namespaced. So this is where applications use it, right? And then the claim is just pointing at which composite resource it's supposed to be corresponding to. And then composite resource manages all the other things that clients actually use. Now, this works, right? But the problem is that there is a lot of resources in AWS. If you look at provider AWS, there's like 300, 400 CRDs, right? To create composite resource for each of them and making it XRDs for each of them, that takes a while. We need a way to automate it. Thankfully, we actually didn't have to do this. Thanks to Christopher Her, he is a maintainer of the provider AWS. He works for a company called Dutch Credit Bank. And they are using cross-playing in their environments and they pretty much run into a very similar problem that we are running into. So they were actually generous enough to open source this tool and give us how to use it and all that stuff. So what this tool does is it's just going to go into these CRDs and photos, get all these CRDs and then create competition and XRDs for them. Then when you are creating those competition, XRDs and all that, right? You can also specify certain requirements. Say like, hey, you have to use this KMSki. You have to use this tag. You have to have this label. You know, all that stuff. So you can use that and then generate competition, XRDs, everything and then check them into get. Now, we don't want them to be available all the time, right? We don't want them, all of them to be available because we are effectively just doubling the amount of CRDs you're installing in the cluster. So like Vikram has mentioned, there is an algo workflow to do exactly that, right? You have an application repository that triggers the algo workflow. If the certain XRDs or your kinds are missing, just reach out to the repository and then apply that to the cross brain. And then from there on, the workflow is the same. So after all that, this is what the tenants workflow look like, right? They all have to do is they just have to say, hey, I want an audience cluster and then just use a claim for that. And then claim is going to create the composite resource and then composite resource is just going to create Amazon RDS cluster in this case. Okay. Now, the next category is block exploits, right? So even though tenants are not supposed to be able to create cluster resources, right? Cluster resources. It might be possible for them to, you know, exploit some sort of weakness or anything like that to kind of, you know, use this manager resource manifest instead of going through the composition process, which, you know, again, it's not going to work if you're just using manager resource like this. So how do we solve this? We just use open OPA, right? Open policy agent, gatekeeper. If, you know, request comes in like this, we just say, hey, it's not allowed. Here is an example policy you can use. We're just saying that, hey, it's not part of the composition. We will just deny you. If this composition is not part of the Adobe-provided composition, we also deny you. And then the final category is high performance. I don't know how many of you have used cross-plain in your environments, but, you know, you install cross-plain, right? Then install a couple of providers. You might get something like this, right? You could kind of get something, and then you have to wait, you know, eight seconds, six seconds, something like that for you to come back. That kind of problem is due to the large number of CRDs installed in the cluster. So in our case, we only have like 1,000 CRDs, but that's enough to cause this kind of problem already. So there has to be a way to, you know, not have this kind of throttling issues, right? If you're, you know, if you're trying to help our tenants use this workflow, you don't want their workflow to slow down because of these kind of problems. How do we solve this? Luckily, we actually didn't have to do anything. Upbound is the company behind cross-plain. They have been working very hard with the Kubernetes community to solve all these problems associated with the large number of CRDs. Thanks to their effort, a lot of the client-side issues being resolved, starting from 0.24. Some of these changes are also backborded into 2.2 and 2.3 as well. And our server-side issue is also resolved in starting with 1.25. And some of these are also backborded into 1.23 and 1.24. I'm not going to go into details on what specific issues and everything because I didn't solve it myself, right? But there is Nick Koop's blog post. Nick Koop is a principal engineer from Upbound. He wrote an excellent, excellent blog post about this issue, so I highly, highly recommend you guys to go check it out. With that, back to you. Thanks, Manu. Thanks for going into all these multi-tenancy and security requirements in detail. With that, let's try to sort of compare the previous and the newer developer experience that we had. We talked about the problems that the service teams and the platform teams have in the new world. The service team is really happy because they are able to, you know, use GitOps for both their compute and infrastructure deployments and also be able to specify and track the provision infrastructure resources in a Kubernetes-native way. And as far as the platform team is concerned, they are even happier because they are able to define these blessed composite resources in consultation with the security teams and then roll them out to the various service teams and also be able to have improved auditability and observability as far as the infrastructure resources that are getting provisioned as concerned. And with an improved understanding of the service architectures, we are able to reduce the meantime to resolution when we encounter any issues or outages. So as far as any newer solution that we come up with, it's not without its challenges and unknowns, as you can imagine. So we have our own set of challenges and unknowns that we are sort of looking into the first one being the hub cluster and, you know, the Kubernetes performance issues which were mentioned by Banabu. So there's this one hub cluster right now and we're trying to figure out and test how much it can scale, you know, with Argo and Crossplain and the multiple CRDs. So we're doing the testing right now and then trying to figure out that will it be able to scale to thousands of services which is the kind of scale that we're talking about at Adobe. Okay. The second thing is that we've primarily been investigating the AWS space and Azure is something that we are starting to investigate right now as to how much support does it have from a Crossplain perspective. We have been able to solve a lot of problems in consultation with AWS for AWS team as far as the AWS provider is concerned, but Azure is something which is a little bit unknown still. Finally, I think next one is around the technology maturity with Argo and Crossplain are, you know, incubating projects in CNCF, but, you know, they're still, you know, a road ahead for them to be able to graduate. We keep on encountering some problems or the other here and there. So we're trying to sort of work with the, maybe the community in order to see, like, you know, what kind of issues that we have and trying to resolve them on the way. At Adobe, of course, as you can imagine, everybody is doing infrastructure provisioning already, but in their own way. So there's a lot of tooling inertia. You know, a lot of teams are using Terraform, for example, so you are trying to work with them as to, like, you know, how Crossplain and Terraform kind of work together and, you know, what is the path forward for them and also be able to provide an easy migration path. We don't want them to, you know, jump into a land in a river full of crocodiles, as you can imagine. So also, I think, finally, we need that community support. As much as we'd like to contribute back to the community, we also need to make sure that the community is supporting us in our efforts going forward. So we're keeping a tap on that and working actively with the community. With that, Manabu and I would like to thank you for, you know, taking out the time. There's a QR code if you want to provide feedback. Thank you.