 Hello everyone, let's get started with the next session. I think we had very good introduction from Mohan on the multi-tenancy from Myself, this is Vamsi Krishnasamudrala Principal architect and enterprise architect for cloud engineering platforms for American Airlines Okay, so let's get started. How many of you had coffee this morning? Okay, so Shravan, a quick question. What do you call a part that runs on a coffee? Yeah, it's a caffeinated container So let's get started. So You can see a small logo. That's a Cape Pass. It's which is called Cubilitis platform as a service That's what is our internal branding of multi-tenancy cluster that we hosted American Airlines. So Our topic navigating the multi-tenancy maze and avoiding the anti-pattern pitfalls What we'll be discussing today. So we'll be discussing about what is a multi-tenancy maze and understanding the challenges of multi-tenancy common anti-patterns that that we go through while setting up the multi-tenancy implementations and there are some pitfalls to avoid and deep dive into anti-patterns lack of isolation scaling and something deep dive into life cycle management of clusters and inadequate resource limits if there are some inadequate resource limits how to handle those and Some successful stories what we went through in setting up multi-tenancy at American and some key takeaways and last Q&A Okay, I think Mohan had introduced this slide already in his talk This is Europe rail railways system, right? So this can be compared to a multi-tenant cluster and I'm from US I I completely Different from what we use we use cards a lot and here the rail system So when we compare to it in the context of Kubernetes, this can be compared to the rail system in Europe So just as how the Europe rail system manages effectively and standardize the movement of persons and goods between various destinations multi-tenancy in Kubernetes is that kinds of concept which enables resource utilization which enhances scalability and ensures secure operations within the cluster So this comparison when I landed in Europe. I thought okay. This is the best comparison what we can relate it to multi-tenancy So let's pause it here, right? So this is a mind map diagram of Kubernetes maze So if you see how many dimensions if you have to maintain a Kubernetes multi-tenancy cluster How many dimensions of it? So there is a There is an automation. There's troubleshooting. There's observability networking security storage and there are some miscellaneous concepts which you can see as an API gateway disaster recovery apart from control plane if you're not using cloud hosting then there's like a control plane and There's a deployment that we need to take care of whether it may be an on-prem or it may be cloud and kind of nodes And if you see how it branches out as a maze, right? So How many of you remember still Joining the maze in newspaper cuttings or somewhere just passing from one source to the destination, right? But here these branches in a maze every branch in a maze effects in every other way So we need to handle this maze properly if you don't design it properly It might affect the central whole system at all. So let's Deep dive a bit on what are some common anti-patterns that we see implementing Kubernetes Multitenancies, right? So yes, we have seen this morning mohan and talk about from Microsoft also just now We talk about network isolation. So for us We know namespace is a resource logical isolation of resources in a multi multi-tenancy cluster But if we create the namespace, are we secure are the tenancy cure? No, they can use The resource quotas next company source quotas, right? So unless until we define the resource quotas and limits for each namespace and part, we don't Think yes, they might use of the company resources. So once we set up resource quotas and limits, we think we're good That comes the default networking where any part can talk to any part inside the cluster So the network the network isolation comes into picture That is where the service measure something that we implemented, right? But again, if we implement those also, there are other Branches, which we saw DNS that DNS someone can abuse a cluster a core DNS can be one application completely taken core DNS Also can run it, right? So like this, there are multiple aspects and multiple dimensions of it of each when we deep dive into So there are three things which we listed one is a limited scalability When you design your applications for multi-tenancy when you don't estimate the growth of the clusters and when you don't Know, how are you hosting the tenants? There's always There's always it's a it's it's a anti pattern that we need to design it properly and you need to Scale you need to design for dynamic scaling rather than a static scaling. So that's one of the anti pattern And the other one is overuse of global configuration So another anti pattern that heavily relies on global configuration that limits customization of the tenants So where one action of one tenant can completely disrupt other other tenants also So this is one of other anti pattern. And the other one is hard-coded dependencies When we are not taking into account of our tenant needs. So something like a privatization rules or Localization or something like JADR or if it's a payment systems that are PCI So different kinds of tenants have different kinds of needs So we should not localize something for some tendency. So that's why we want to keep it more flexible We need to design such a way. So that is one of the common anti pattern Come back to pitfalls, right? What are some common pitfalls in? In implementing the multi-tenancy if you can see Operational complexity when we have seen those many things around For setting up the tendencies. There are a lot of complexities that we need to Design for so we need to keep in mind that we should not over complicate the Setup itself inadequate isolation whether it may be as we discussed whether the isolation may be at the logical grouping networking or anything So every layer needs to be secured ignoring security risks You need to have well like you need to have the security taken into account Babe that may be vulnerability management or whether it may be Scanning across your clusters parts runtime scans. So you should not ignore the security risks that for application Because when you're setting up for multiple tenants multiple use cases improper dependency management and Resource contention so resource contention It's like for example take it as an overbooked hotel. So a hotel can take only the from X number of Customers, but if you're overbook it You cannot accommodate it. So we need to design it properly. There's always those contentions. We need to plan ahead Lack of cost management. So when there are multiple teams on the cluster, you need to plan accordingly to charge back and show back the cost how are we Meeting the needs of the tenants and how are we showing back? They should be There should be some mechanisms to just show it things out, right? So out of this these are the common pitfalls, but there are always ways how we can address it So there are always two different blocks which we can address by these pitfalls Either it may be with the tool or it may be with the best practice, right? So you should use a right tool for the right job What we are doing it or it may be a best practice. So the best practices are not like I always discuss with Shravan, they are not like a butter that is spread on bread, right? So it needs to be grow like a garden. So it takes time, but you need to inherit this best practices into your thing into your architectures. So The tools may be using resource limits and quotas as we discussed a network policies or add-back controls HPA or Keda based on they even driven if they have an even driven things Part security policies or secrets management centralized lying and monitoring because the observability needs to be centralized It's not like each tenant needs to get something or the clusters if they have specific use cases or backlog You need to you need to specifically move it as a centralized concept where the tenants can easily adopt to Best practices again the namespace isolation dynamic environment provision You should be we should be able to provision the ephemeral clusters whenever the customer needs the our back policies again Educating the tenants what we are bringing into the clusters or into building into multi tenancies or whatever The new capabilities that we are bringing in the documentation or the API is anything that we need to educate our tenants as We are bringing in and the other thing is efficient resource utilization. So As we said see because we have my we have multiple applications sitting across whether one cluster or multiple clusters We should be able to use the resources effectively. So So these are the ways which we can avoid the pitfalls But let's deep dive into a couple of topics my my friends on will take it up Hey again, so let's look at few dimensions of multi tenacy so some of the dimensions we're going to talk about is standard workload templates and workload types and how can we isolate these applications and one of the critical pieces lifecycle management of these multi-tenant clusters and understand scaling So let's get into workload templates. So there are few ways we can address workload templates For example, we can have operators either we can build custom operators that can like we create our own CRDs and then that can create Kubernetes resources as per our need or we can develop hand charts Or else we can use Customize so that you can deploy the same workload across different environments we will dig deep into operators into neck and demo into the next slides later and Workload types so as we build templates, we can also incorporate workload types like Some of the workload can be memory optimized or some of the workloads can be CPU intensive, right? So as we build templates, we can incorporate these workload types into the template so that you have a standardized Templates available for end users to use. So for example, we can use PCI or PIA compliant which has more network needs and In a multi-tenant environment isolation is going to be a critical piece as Most of us were talking in previous talks also. There is a network isolation and there is a resource isolation So let's take a resource isolation. So as we host multiple workloads in the same cluster The isolation with its namespace level can be achieved using Resource quotas. So this way if a single application cannot take entire resources So as for example, if there is if a namespace is taking 2 gig or 2 gig memory We can we can limit it to 2 gig only so that all the resources are not taking all of the Memory is not allocated to the same namespaces and the deployment level also we can put quotas for CPU and memory. For example, we can put limits for CPU and memory so that all the Deployment levels are not taking all the resources and that the network layer we can set Network policies or if you are using any service mesh we can use traffic permissions and and and the next thing is life-cycle management. So in a shared infrastructure Life-cycle management would be a critical piece. So some of the aspects of life-cycle management would be Automate wherever possible and test it in pre-prod environments before you send it before you implement it in production and you have backup and Recovery strategy and you make sure you have observability and monitoring and you document as you build these strategies So I'll just spend few minutes here. This is an example of a platform. We built at American Airlines For example in this picture here the developers come to Intel and development portal called runway which is based of backstage and Here they can create a namespace In a specific cluster and that namespace has quotas and RBACs assigned to it Once the namespace is created we have templates and Waclo types They can choose a template and once they choose a template We will create a GitHub repo for them and in the GitHub repo We will have some custom CRDs deployed in it based on the user specifications And then once the custom CRD is there the Argo series is going to take that Manifest files and deploy into the cluster and on the cluster We have operator that you know converts that custom CRD into all Kubernetes resources. So let's Let's look at a quick demo On how we can do this So here I am I'm creating Application or into the cluster deploying an application into the cluster here. You could see this is a custom resource definition So this custom resource definition has a standard templates in the background So users don't have to really worry about you know defining all Kubernetes specs So for example here, they are just specifying the image name the port they want to expose and the artist telling policy how many parts they want so I'm going to create a namespace where this custom resource definition is going to be deployed The CRD we call it as a web app So I'm creating the namespace here and then I'm going to deploy the custom resource definition here So operator is one actually watching for that CRD that is deployed on this cluster and once you could see that The resources are created in the background this custom CRD is going to create all Kubernetes specific resources with the standard templates and Then we can also manage the life cycle management, right? So for example if a user want to change the resource quotas or deployment quotas We can expose this using the CRD or else if we want to delete these resources You can still manage deleting these resources using these operators so we Repeat this process for hundreds of applications. So To make this happen to build this platform we have About 15 plus components deployed on our Kubernetes cluster for example some components are related to observability like dynamic rays mismos Fluent bit and there are components like velarov which is for backup and recovery strategy and there are components for our back Like at your cloud ability Or sorry for cost management up to your cloud ability or else like secrets management has you got vault and continuous deployment or go CD so we as I said we Repeat this process for hundreds of applications and we have fleet of clusters with is set up to Support this platform. So standardizing these clusters and you know life cycle management could be a critical piece So for this to simplify the process we we extensively used Argos CD application sets Where a single Argos CD application set would deploy all these 15 components on all of our clusters So we manage all the manifest files for these individual components in a github repo And then we use a target revision version for each of these component and update the Argos CD application So that all these components are easily deployed across all our clusters And another consideration in multi-tenant is scaling So as you know when we create a multi-tenant cluster The dynamic scaling is so important for us. So some of the considerations we usually take into Scaling is like network plug-in selection and pod-seller selection or else IP allocated to your cluster or else automated policy and monitoring them So for example, we use a case cluster we started using a cubenet plug-in, which is a default Azure Kubernetes plug-in. So we identified that number of nodes that this Plug-in can choose is only 400 So we moved to Azure CNI where the number of nodes can scale to more than 400 nodes And then part-seller considerations. So some of the network plugins need to need needed a part-seller So if we choose a small part-seller that would translate to few number of nodes in the same way cluster subnet assigned to the node as well Choosing the correct cider would make you should make sure that we have enough nodes available when scaling needs are there Once we enable scaling, we'll still have to monitor make sure the scaling is appropriate So on our Kubernetes clusters in Azure AKS we use auto scaling so We enable auto scaling and few few of the applications have the wrong configurations where we end up Having more nodes for example a long configuration end up having about 20 nodes on a Over-feet of our clusters which cost it which costs around $10,000 per month to avoid this We have to monitor and make sure the scaling is appropriate and address those needs and then This is a quick poll if you guys can scan there is there are a few questions you guys wanted to Let's let's take this this quick poll. We need your help. So let's get to this poll and see some answer some questions so Yeah, let's see how many of You guys are using Okay, so Yeah, so I think everybody answered this question. So let's skip to the next one. So let's see this one if yeah What is your preferred method of updating equipment is cluster because we have gone through the life cycle management So you have seen because we have gone through this cycle of either spinning up a net new cluster or putting your workloads completely taking back up with LRO and putting it to the net new clusters and Building up a new cluster Preparing as we thought we are preparing for a DR and we would do it for every other upgrades but when your cluster grows It's it's a tedious job because now the cluster has grown outgrown that Doing that to a net new clusters are keeping up with an alternative pattern Maybe with our go CDR acuity or keeping your application deployments like that So we just want to see at the pulse of how your communities upgrades are like spinning up a net new cluster or the in place upgrades What is working better for you in places winning in places winning? Okay, so let's go to the next question strong So how frequently do you upgrade? Is it as soon as a net new version releases from fibrinitis you upgrade or with n-2 or minus one that maybe a quarterly half-yearly annually or I see some of us don't upgrade the clusters. So this is one of the pitfall and add a pattern So we need to keep up with at least in minus two Questions, I'll we have some more topics. So I'll take the Q&A and this is the last question So what is your preferred method of testing? Upgrades before applying them to the production. Okay, so it's other I don't know what what goes into other but yeah, so let's see That's winning. So this keep me curious to understand to understand why what's what's the other method that you guys are using it. So But yeah, thanks. Thanks a lot for taking the poor That was very helpful for us to understand how the trend is So let's get to our successful implementation and our success stories at a right So where we have started. So we have started the qubit is generally like any other tenant Like five or six years before we have started with on-prem clusters and later move into cloud as a cloud first strategy And then with AKS completely we host the clusters But what we have seen with the trend of multi-tenancy is we have grown our multi-tenancy Applications at a very rapid pace from class one and a half year. So it's almost like one less than one one and a half year When we sat at the multi-tenancy, there was a lot of Individual clusters with the dedicated clusters. It is for the small application Wasting the resources because when you are spinning up a cluster, you need to spin up your system components node pools for system components and a note pool for your app workloads. So You can see here, right? So we did avoid spinning up the dedicated clusters like a mushroom clusters, which are very repeatable workloads or the standardized templates Which we can build up through front-end We are now hosting. This is the last week's data We are now hosting around with the four clusters with east and west as the DR regions and with prod and non-prod across four clusters We are hosting around 659 namespaces 1736 deployments. So This is a huge achievement for us where where we started from one and one half year So that because the cloud migration which we are pulling it through this this has been possible and there's other Story as we said we should not ignore what the customer needs are right so the PCI So this is another there's a payment card industry There are some adhered rules that we need to because AKS by default is a public and we need to build a private connection clusters With that when we open it up for a multi-tenancy we got more Customers onto it. One is a payment services and reference application. I don't know how many of you flew American Airlines to come in I don't know what let's see how many of you flew in American Airlines No one. Yeah Okay, so but so the refunds application whenever you have an issue So those kind of when you call in those refunds and those kind of applications are still They're running on Kubernetes So what we did for the teams that are running those applications are the time to market because when the teams wanted to build their Own clusters with their private private networking and the security and other things Observability that we need to build across it takes about three to four months to completely work with all the teams But with our multi-tenancy clusters, they were go. They were about to go live in one to two weeks so that was a huge achievement and For the application teams that we are building through with different kinds of workloads that we are as trouble mentioned previously Different kind of workloads, but different kinds of clusters it reduced complexity it for the compliance management It was easy for the audits and it has enhanced observability, which they did not work on So there are some three conclusion conclusion and key takeaways for the day. So always embrace the best practices While navigating the multi-tenancy means that we saw in the first couple of slides, right? So there's a lot of Lot of things that we need to take care So we need to always embrace the best practices for monitoring, securing and managing the resources. There's continuous optimization because optimization is a process When the tenants come through it the bin packing or when you're giving the namespaces or the quarters and everything So there's always a optimization that is needed on the clusters and we have multiple sessions coming up how we did with the granulate and how we reset our platforms for an optimization across watch out for those sessions and stay vigilant against the pitfalls So what are the pitfalls that we have mentioned it stay vigilant against it for all the developers administrators and architects There's always a potential challenges. We need to stay ahead of the curve That's it. So now we are accepting all the queries confusions and Kubernetes confessions There's small help. I think there's only one guy. So if you can raise your hand so that they can he can come to you Yeah, only one person so that one or two questions maybe Wow The responsibility um, I was curious about your release pipeline is you kind of Showed that you're using gate ops. You have um Basically staggering by regions. How many how many clusters are you targeting for any step? Because like avoiding a big bang rollout is part of this human this release So so all of all of our clusters are Same so we deploy the same version across all clusters for example if you are upgrading a component like Grafara cloud right the agent really is there is a new version So the manifest files we update the new image or any configuration changes and then use the same manifest files and create a Target release and use the target release version across all the august ad applications So all of our clusters are same versions. Oh We are about 30 clusters 30 30 30 clusters around it. Thank you Thank you all. Thanks