 Welcome to the CNCF End User Lounge, where we explore how cloud-native technologies are adopted by end user organizations across different industries and sectors. CNCF End User Community is formed of more than 155 vendor-neutral companies that use open-source software to deliver their products. I'm Dave Zolotewski, a principal engineer at Spotify. Today with me, I have Rajan and Dave from Fidelity Investment as our guests. In these live streams, we bring end user members to showcase how their organizations navigate the cloud-native ecosystem to build and distribute their services and products. Join us every 4th Thursday at 9 a.m. Pacific Time. This is an official live stream with a CNCF and that's such a subject of the CNCF Code of Conduct. Please don't add anything to the chat or questions that would be in violation of the Code of Conduct. Basically, please be respectful of all of your fellow participants and presenters. If you have any questions for us, we'll be monitoring them throughout the stream. Make sure to ask your questions in the live chat. This week, we have Rajan and Dave here to talk to us about Fidelity Investments. Before we get into the conversation, Rajan, Dave, could you briefly introduce yourself? Could you briefly introduce yourselves, please? Yeah, sure. I'll go first. My name is Raj Rajan Poripati. People call me Rajan. I'm part of an organization called ECC within Fidelity. Within that, I'm part of the cloud platform team. I primarily work on the Kubernetes-based projects and the goal of our team is basically to set up the next generation application platform for our users. So to put it in the right words, if, let's say, a developer within Fidelity wants to do something to achieve a particular business objective, we wanted to make it as simple and as easy as possible for the developer. So that's me. Very happy to represent the cloud platform team. Hi, my name is Dave Batello. I'm in charge of the private platform squad within Fidelity, where we're primarily focused on building out Kubernetes platforms to run our container workloads on-premise. We have probably around 40 production clusters on-premise of Kubernetes running at one time, supporting a variety of workloads with close to about 10,000 containers. Cool, that sounds like a lot. Before diving into the Kubernetes part, yeah, I'm sure it feels like a lot sometimes. Before diving deeper into the Kubernetes part, curious to, can you tell us a little more just about the infrastructure setup at Fidelity and what prompted you to start adopting cloud-native tools in Kubernetes? Yeah, I can start with that. So basically, we have a mix of on-prem and the cloud. We are into multiple cloud providers as well. So basically, just to give an idea, instead of just setting up clusters and then opening it up for the users, our goal basically was to come up with a platform which is more Fidelity-specific in the sense that we want all the best features available from the CNC technologies. We want that to be available for the users, but at the same time, we have some hard constraints from an enterprise standpoint. So for example, today, if I'm a developer within Fidelity, it's not very easy for you to just go and spin up your own Kubernetes cluster and then deploy an application and take it to production. You have to take a lot of security aspects that come into picture in terms of your image that is using, that's very important. In terms of the AMIs, there's a long list. So, and this particular list, for example, it sort of keeps changing as well. So for example, there could be a security event, not just with Fidelity, but anywhere outside that could actually trigger like a different policy or change to a policy. So these constraints are something which, as a developer, it's not so easy to keep up with. So the goal that we set out was to actually build an application platform where we take all these constraints into account and we sort of come up with this platform where as a developer, you get to experience all the features, the best features from the CNCF technologies. At the same time, you're guaranteed that you are running in a secure Fidelity environment. Security is one aspect of it. There are other aspects like compliance and a lot of other things, which we'll get into the detail. But that was the goal. As a developer, we want to make it really, really easy so that you're able to quickly deliver the business objectives. So we have, so Dave mentioned Fadi, that is more Fadi clusters. So that's on the on-prem side. So total put together, we have like EKS clusters, AKS clusters on AWS and Azure. So total crosses like 300 plus now. So we have a dashboard and sometimes we look at it and it sounds like from like one number to another. So I'm pretty sure we have cross like 310 or something like that. So that's a high level setup. Dave, if you want to add. Yeah, I mean infrastructure from an on-prem perspective, primarily where we built out our platform around vCenter, around vSphere infrastructure. For our on-prem services, there's also a proprietary API platform, which we built on-prem that sits in front of our vSphere infrastructure. And we leverage that for a lot of the provisioning of our virtual server instances that we lay the foundation upon for building Kubernetes. So a large portion of the responsibility for building out the platform entails also understanding how the infrastructure works behind the scenes and tightly coupling our integration deployments, our Kubernetes build-outs on that infrastructure. Yeah, and I just wanted to add on the CSEF technology standpoint, use Kubernetes of course, but we do use the Helm. So Helm is our standard for packaging. Most of the time when I actually mentioned something like Helm, so we always have this approach where we don't want to restrict users to the full-estection. So we always have the famous thing where it's batteries included, but it's compatible, right? So we do, as a user, you have an option to switch to something else as well, but Helm is one of the most widely used packaging. You know, Keneson, this platform we'll talk about more in detail, but it's kind of like a base platform on top of which you can actually run a lot of things. So in that perspective, we have like Envoy, which is running on top of EKS clusters, as well as a part of like an API gateway and stuff like that. So it's like a combination of all these technologies. That sounds really good. Are there other CSEF projects, aside from Kubernetes, Helm and Envoy, that you're excited about or either using now or playing to use in the near future? We use, yeah, we are constantly exploring the, so almost always we try our little best to not create something in-house. We try to look at where the community is going forward and we want to stick to the community, right? So almost always we sort of look at this landscape and make sure that we pick something from the landscape. So in terms of, for example, we use like some, if not full, some components from the flux, CD, for example, Helm operator, which is part of flux, Helm controller now. And now it's, I think it's GitHub's toolkit, right? So we use that extensively. We actually have built open source project called CRON on top of it. So basically we took these technologies from CSEF and built something on top of it just to extend it a little bit to our use case and we have sort of open source that as well. So GitHub's toolkit is one example and CRON is the open source project that we have built on top of it. Yeah, we use continuity, right? And these are the major ones and we're always constantly exploring projects. I think, Kartham, if I'm on the telemetry side, we are looking at, you know, Fluentbit and stuff like that. We use Fluentbit for a lot of our log collection. We also are pretty heavily invested in OPA. So from a governance and compliance perspective, so we're using OPA to build policies and constraints around, you know, how we govern the platform itself. So there's all different types of policies that we've implemented to enforce specific people, the metadata that are associated with namespaces. We're looking to make that migration very shortly over from the native PSP policies within Kubernetes to OPA to do that policy enforcement. So that's just a couple more examples of some of those CSEF projects that we're using. Yeah. Now that makes a lot of sense and I think both of you alluded to were touched on security for the environment and I don't think you necessarily said it, but I assume there's a lot of regulation as well. So I'm curious how being in such a regulated and required to be highly secure space impacts the way you look at all of this infrastructure and open source tooling. Yeah, so we have very strict, you know, regulations. Being part of the financial industry, that's actually important as well. So the way we look at it is it's very important to have, right? So we have built the platform in such a way that from a user standpoint, for example, today if you talk to any of the fidelity developers, they don't look at, if let's say they are, let's say in AWS, they don't look at it as if, it's an EKS cluster or a Kubernetes cluster, but they look at it in terms of like a fidelity platform. Internally, we call it as like FID EKS, FID AKS and stuff like that. It's always referred to as like, hey, FID AKS version one. So we have our own fidelity platform versioning. So they usually say FID AKS 1.0, 2.0 and stuff like that. So coming to the security point, what we've done is we have sort of packaged all these things as a part of the platform, right? So whenever we sort of make a release, what we do is like, for example, there are all these add-ons I'll give you, I'll give you an example of the OPA which Dave was mentioning about. That's something which is part of the fidelity platform. So from a user, for the fidelity user standpoint, he doesn't look at it as like one standalone add-on which is running in a cluster, but he or she is basically exposed to the features that comes out of the add-ons. So the way we try to portray it is basically, we have this platform and you have all these features that are available for you. Forget, don't focus more on the add-ons aspect of it because behind the scenes between FID EKS 1.0, 2.0, we may actually switch the add-on to something else or we can combine two add-ons. We'll do a lot of things as a part of the platform, but from a user standpoint, they just look at it as a feature. So from this perspective, if you look at it, we sort of built in the security features within the platform so that let's say if you are going from one Kubernetes version to another, we sort of have this rigorous process where we check every single add-on that is part of the platform, it goes through as a process to make sure between the versions, between the version of Kubernetes as well as between the version of add-ons, is there anything that has changed that impacts our security guidance? It could be as simple as a particular add-on version, not image, let's say, base image that is part of this new add-on version, maybe that is not compliant with some of the current security policies. This is one good example. So those are things that we will actually validate as a part of our rigorous validation process before we release a platform version. So what happens is whenever the users get a notification that, hey, we have this FIDI S2.0 or FIDI AKS 2.0, most of these things are actually already handled for them and we do make it part of our internal release so that they are aware of what all things. Sometimes you have to take a slightly different approach where, for example, let's say we want a particular change, it's not in compliance with one of our security policies. At the same time, let's say we are unable to get that change from the open source project immediately, right? So those are the cases where we'll sort of come up with a certain workaround. It could be like for a very small period of time, we could actually do something where we'll release the feature as a part of the platform version, but at the same time, we'll sort of do some workaround for a period of like two months or something like that until the actual change comes from the open source project. So these are the things that we do as a platform team, but from a user standpoint, they're unaware of these things, right? From their standpoint, you have a feature that is very stable and it is working. Yeah, and so what does that make versioning and upgrading look like from a user standpoint? Like you release PDKS N plus one. So basically what happens is, so maybe a little bit on the fidelity structure, we're not like one central team which manages all the clusters. So the fidelity is a large organization and we have like a lot of sub-organizations, right? Each business unit itself is a company by itself. It's like a lot of developers and they have their own dedicated DevOps teams, SRA teams and it's like that. So the way it works is a business unit will have a DevOps team and operations team. So when we release it, it's actually them taking the platform version and then upgrading the clusters. It's not as if like we will, we sort of give the tools in place. We sort of have like a UI where they can actually go and do it, but we don't do it for them. They have their own timelines and it's up to them. So basically it goes like this. So we release a PDKS version. So we'll have a call, we have a release call and then the user sort of get to see what are all new features are coming in, what are all the breaking changes and then they get to decide okay, when they can actually do it. We do have a timeframe where we support like N minus three and minus four version. It's not as if a particular business unit cannot stick with a particular version for a long time that's there. But at the same time, they sort of pick the version, they actually upgrade the clusters. And if there are any issues, then that becomes like a platform issue. It's like an issue happening with platform 2.0 and then we sort of jump in. And we have all these monitoring dashboards and everything set in place. So we would know our friend if somebody is doing a cluster upgrade, platform upgrade and if something is going wrong, we would automatically know. So that's how it works. So from a DevOps team member who's actually picking up the version, we might have packaged 20 different add-ons within the platform and each could be in its own version. So they're not worried much about it. From their standpoint, they look at the entirety of it as like one single version. So even if one of the add-ons is not working, let's say a particular feature is not working. So they just, from their perspective, it's basically the platform version that's unstable. So we just released like a patch version for it, right? So that's how it goes. That's how it goes actually. So on the private cloud side on Pram, so what Rajan was talking about was a lot of what, how we manage things out in the public cloud and large portion of that is self-service, right? So they cut the release, they provide the release out to the customer base and then the customers are consuming it and then they pretty much have a, the ability to roll it out on their own schedule. On Pram, we're a little bit more prescriptive over it. We take a little bit more control over it and it's probably more of a managed offering more than anything. And so what we've been trying to do over the last year is try to obviously keep up with the Kubernetes versions. That's a challenge, right? So this past year, we had a target to try to do four upgrades in one year. And so we did four upgrades this year. By the end of this year, we'll have done four Kubernetes upgrades with a target of one per quarter. So hopefully by the end of this year and maybe my team's listening, we'll get 120 out the door by the end of this year. And so that's a substantial amount of workload. That's just to keep up with the versions. That doesn't include all the work that we do around like add-ons, things that Rajan was mentioning around maintaining the versions for all the different charts that we roll out to support the environment and provide other capabilities. Yeah, I see a question from Maybuk. How do you keep up with the upgrading? Dave, if you're okay, I just want to touch upon that. Because that's something which I think some of our learnings can be useful for the users. So there's this problem, right? Especially when you are multi-cloud. So imagine we are trying to have a platform which actually provides certain features. So from a user standpoint, they're looking at this unified platform and that is supposed to run at a speed of where you are whether you are an on-prem, whether you are natively as a Azure. It's a very difficult thing to do. Especially when, for example, on the on-prem side, let's say we are using Rancher. There, the versions that Rancher will support will be slightly different from. I'm referring to the N-minus, four N-minus three problem where, for example, one vendor might say, hey, my current is 120 and I followed the N-minus three model. So at any point in time, 117 is the latest. At the same time, another vendor, if you're on the cloud, AKS, AKS, for example, there could be a situation where they are doing 119 and N-minus four. So their least supported version is 116, right? So how do you do this? It's a very tricky problem. So that is where, there's no clean solution to it. Let me put it that way. So that's where we constantly, the platform leads we sort of meet and sometimes we sort of ask Dave, for example, to sort of slow it down a little bit where we sort of catch up and stuff like that. But one thing that we always put in the front is the stability of the platform. Even if, let's say, 120, I'm just taking number, 120 has like some really important feature and then let's say one of the teams or some of the teams are waiting for it, right? If we think that we won't be able to provide this uniform experience, if let's say one of the platforms, let's say if you say that Azure wants to move forward, and then do that where AWS lags behind, if that is going to be the situation, we really evaluate, we try not to that, right? We try to wait. So stability becomes more and more important than releasing new features. So sometimes we'll actually tell the application teams that, hey, this is a feature that you want, but can you live without it for like another few months? Is it like absolutely important? Because that would directly mean that we can actually, the point here is the stability always comes first. So that is one thing. Another thing is it took some time for our internal users to, I think it's a mindset. So for example, like upgrading, these are big clusters and a lot of critical applications are running in it, right? So like a year back when we went back to them and said that, hey, the community most very fast, like Kubernetes, if you look at it as a project, the developers are like amazing. They come up with all these features very quickly. So the version's more very fast. So we sort of release versions as well, right? So it was very, a year back, our internal customers, it was very difficult for them to digest the fact that every two months or something like that, they have to do a major upgrade. So now if you look at it, looking at how stable the whole thing was, so we have taken it to a point where it's more of a personal thing, but we have taken it to a point where it's okay to do the upgrade, it'll be stable. So building that sort of a thing is very important. So if you can actually put all your efforts towards making the upgrades like really, really stable, then the users build, there's this confidence that builds in the user, right? For example, if you look at our version, upgrade validation process, every single add-on that we use, we have a mix, right? We use some of the community add-ons, a lot of community add-ons and we do have like some of our custom built, we have like a lot of operators. We have a lot of operators that we write. Every single add-on, we have a rigorous walkthrough. We try to, we have like a separate set of smoke test, integration test that is very well maintained, that almost always catch if there is an issue that's mapped to a particular version or something. So there is the rigorous amount of work that goes into validating each of the add-ons that are part of the platform. So we put in a lot of efforts towards the stability aspect of it. So that will in turn increase the confidence for the users and then now it's a new normal, right? Now it's a new normal. It's not as if like, you know, sometime back, upgrades will be like few times a year for major platforms, but that's not the case right now. So building that confidence in your user is very important. I just wanted to add that. Yeah, now that makes sense. And I think I have a question that's kind of building on that exact kind of stability and confidence from the user part. And they're asking about how you make sure that upgrades or updates to any of these components are safe to apply. And on top of that, how do you limit the blast radius as you're finding that some might not be safe to apply? That's a good point. So one of the things that we do is we sort of have a structure where I know it differs from company to company, but we have to follow an approach where we have certain engineering clusters. We call it as test clusters, platform engineering clusters. So for example, I'll give you an example. Let's say between a development and a testing and a production environment itself, like there are usually there are differences in terms of policies and stuff like that. So we make sure that our testing clusters, the platform engineering clusters are on all these spaces. So when we start out, first of all, before even going to the platform, if everything starts from your local rate, we have a very strong set of test cases that are very well maintained. So it's based on a combination of cucumber and all the different sort of things. So we have a very strong set of integration tests or smoke tests that is very well maintained. I keep stressing on the very well maintained because it's easy to come up with the first set, but sometimes over a period of time, you can easily not maintain it very well, then it loses its purpose. So we sort of rely on that, which will actually catch a lot of issues. And even after that, there is a rigorous testing on an environment basis. We sort of test it in like platform engineering dev, platform engineering production. These are efforts, but for our scale, these are like massive for supporting 300 clusters, we cannot afford to make mistakes. We do make mistakes here and there, but we do everything possible, like these sort of things to do it. So that is one thing that we have like strong integration test suite that is like well maintained. We test through every environment type. And after that also, when we release it, again, this is something where we don't upgrade all the 300 clusters, as I said before, it's more of the users picking up the release. So we also try to see if we can actually work with some of the business units who are usually, they're okay to pick up something first rate. So there, we work with them very closely to see if there are any issues in the development clusters when they upgrade. They usually start with development clusters. So we sort of monitor that very closely. We have very strong, in a lot of the telemetry which sort of helps us that if somebody is speaking up a release and putting it in their dev clusters, let's say that is the first 300 clusters that is getting upgraded, like all our eyes around this. So we watch it very carefully. And if we see an issue, then we sort of quickly revert to it. Sometimes we even, it's rare, but we can even like pull out the whole release and say that, you know what? Like we'll come up with the patch fix and stuff like that. So no straightforward answer, but one good thing, one takeaway if you want, I would say maintaining a strong set of integration test suite. Yeah, and I could add on to that a little bit. I mean, from the on-prem perspective, like Rajan said, we definitely have spent a lot of time building out these test suites. Unit testing, functional testing, and ensuring that we're not just doing this testing, at the end of a release cycle, but we're doing these types of tests all the time. And so some of the strategy behind it really is around building that end-to-end testing. Something that we can run on a daily basis, something that is bringing issues to our attention on a daily basis versus, finding out right before the end of the release. I think the second piece of that for us is really rolling out these releases in a little bit of a canary fashion, if you wanna call it that, where we'll do, in our area, we have multiple zones and multiple regions. We'll do one at a time from a non-production perspective. We give our business partners adequate time to cycle through that environment, ensure that they've maybe deployed workloads multiple times, become comfortable with it, and then subsequent scheduling of the upgrades to our production clusters happening during tech windows, during times when there's the least chance for impact to our production running workloads. So that's a lot of the method behind the madness. That's for sure. Yeah, there's another question on the chaos engineering steps. I wanna take that, that's a very good question. So we've been doing the chaos engineering stuff since 2021, early 2021. But the point I wanna stress is, even two years back, I remember very clearly, even in 2019, we made sure that, for example, let's say we want to add a feature to the platform and the feature comes from a particular community to maintain add-on, right? So there are cases where the community maintain add-on might not have like a help test, a testing associated with it. So even when we bring that in, we make sure that before you can plug that into the platform, you have to add your test case to it. So we run help test against all the add-ons. So there is no add-on that can actually go into the platform without a test case associated with it. We also take it a step further. So we have this open-source project called CRON where we came up with the idea of something called layers. So what happens is basically, you have a collection of add-ons, right? So look at this case. So we have clusters running in Azure, AWS and then on-prem, right? There are certain add-ons that runs everywhere, but there are certain add-ons that runs only in the cloud, which is like AWS Azure, and there are certain add-ons which runs only in, you know, on-prem, for example. So we came up with the idea of something called layers. So what we did is we packaged all these add-ons in terms of layers, and for example, we have the security layer that is shared across these platforms, and we have this cloud layer, which is only on the AKC. So the reason I bring up this layer concept is, even in 20, I mean, even like two years back, we were very clear that each add-on should have a test associated with it, and this layer, which is a collection of add-ons, which is closely related, will have like an integration test assisted with it, which is basically another HelmChop. So imagine a layer which has like five add-ons, each is a HelmChop. So each HelmChop has a test, and there'll be the last add-on in the layer, which is a HelmChop, which basically does the integration of all those add-ons. These were significant efforts, but it paid us off really well in the long term. So HelmTest is extremely important, even if you pick up a community project which doesn't have it, please add that to your list. At the same time, come up with like an integration test HelmChop that can actually validate like how certain add-ons, how they work together. I'll give you an example. For example, as a part of our onboarding process, so we created an extension to namespace called namespace groups. So the users typically, they're not exposed to namespace, they always start with something called namespace group. So as soon as you create a namespace group, there are certain things that happen. So your AD groups are automatically created. There are certain things that happen and it's basically a work done by a few add-ons together. So there is like an integration test HelmChop which basically checks this particular thing. So yeah, these are some of the things that we have been doing even from the beginning. At the same time recently, we have early 2021 starting early 2021, we have started focusing a lot on the gas engineering stuff. So that's part of our sweet as well. Yeah, and the chaos engineering aspect of it, right? So, we think that we're dabbling in that right now. I have definitely looked at integrating chaos mesh into some of our pipelines that would not only handle building, building these clusters, running through unit tests, but also knocking things over and then ensuring that the platform continues to function as we expected to. So we're still from my perspective, we're still at the beginning phase of that. That makes a lot of sense. And then for the specific tests, I think there's kind of the question about chaos engineering, but also post-mortems and things. Do you have ways of ensuring that times when it does go down or you do run into issues, that doesn't happen again, like how you bring that back into your testing frameworks? Yes, yes, that actually happens. So let me think about it. So basically, typically what happens is like, when we sort of prioritize our, it could be an obvious thing, but I think that's something I just want to stress upon because it really works well. We prioritize stability over features. That might sound obvious, but it's something which is very, very important. So if we release a particular version and then let's say at the same time, we are working towards this next immediate version with a lot of new features. If we find there are certain things that we have not done wrong, it actually feeds back and then we sort of focus on that first before the new features. So the reason I say it's obvious thing is it takes effort, when you bring that back when you discuss in your sprint meetings and stuff like that, this is given very high importance. So I think it's part of our, maybe I don't know, it's a team culture now that we have to focus more on the stability. I think if you have a small team with a few clusters, then it's a different thing, but especially when you are holding all these 300 plus clusters for thousands of developers and a big organization like Fidelity, stability comes first. So we sort of immediately take that and then put it back to our sprint to make sure that the changes are done to the test cases are enhanced and stuff like that. Yeah, I mean, I'll add it on to that, Rajan. I think that it's pretty much ingrained into our Fidelity DNA that root cause analysis is the de facto method for us come into conclusions of what needs to be fixed, right? So I mean, my team is very well versed in the fact that when we find problems that we need to come to that root cause to understand how we can resolve that so it doesn't happen again so that our application partners don't run into these types of problems down the road. So yeah, we spend a lot of time tracking, trying to ensure that we are opening up stories and understanding when we haven't figured things out, that we get back to those and we drill into those things. As a matter of fact, we were talking about some of those things this morning before we lucky enough to join your broadcast here. So yeah, and then maybe this is an important question, right? So we sort of look at things in a slightly different way. So for example, let's say if something is happening on the customer side, right? We do have three different environment which is supposed to mimic the customer environment in terms of like security profiles and everything. So if there's something that they're catching, we are not catching, we sort of try to look at why this difference came, right? Which means like some, there's a mismatch in terms of like how the environment is configured versus us. So we even try to look at the process that, the fundamental process that was actually broken so that this can actually occur. So we sort of go and fix that so that not only this problem will not happen again but like many such types of problems will not occur. So like we sort of go down to that level there even if it's like a fidelity specific process which is like very basic, we try to push to make changes or automate that in such a way that, so we basically try to go analyze the base, not just like on a high level why this happened, not from this one particular problem's perspective but to an extent where how do we prevent not just this problem but similar types of problem from occurring. For example, one example is it happened maybe a couple of years back but the way our IM rules were managed in AWS. So we even like took a big step and came up with our own framework based on stack sets and stuff like that. So we changed the whole process. It was a lot of effort to actually do that relatively but now when I look back that is one of the important things that we did. So we changed the process. We had to get a lot of approvals because that was like already a hardened process but we sort of got the approvals and we changed it. And now after that like, we've not seen not just that issue but like that space is like very stable. So you have to go to that level if it makes sense. Yeah, that does make sense. I guess one more potentially quick thing on testing before we move along. You talked a lot about testing and it sounded to me like you were talking a lot about testing the fidelity platform like pieces that you've built. I'm curious if your automation and your tests also catch potential issues in the tooling. Like if there's a change in Kubernetes that breaks something for you, does that get caught here or is there a different process for catching things like that? No, it's big then actually. So for example, even if we bring, so the integration test fees that I talked about, so even if you bring a, it includes test cases for the community and it's not just our stuff. So sometimes we actually go and raise issues upfront and it benefits the community as well. So it actually involves all the Kubernetes stuff as well as the community add-ons. Cool, that makes a lot of sense. So then I just wanted to take a bit of a step back and hear a little more about the overall architecture. I mean, you've mentioned multiple clouds and you've mentioned 300 clusters, but that's about as far as we know. So I'm just curious to kind of dig a little deeper. How big are these clusters? What sort of things are you running in them? Are they multi-tenant? Just all of the kind of typical things you think someone would ask about how your clusters are organized? Absolutely, absolutely. So I'll start and then I think we can also jump in. So most of our clusters are in the medium-sized range. Medium-sized in the sense, I don't think it has more than 75 nodes or something like that. So it's not like thousands of nodes, not for sure. At the same time, it's not small estimates. Most of them fall in that range. Everything is multi-tenant. So that is one of the key things that we decided back when we started in 2018, it's a bit decided. So there are a lot of processes built around to support the multi-tenancy. So one of the examples was stuff that I was mentioning earlier around extension of namespaces. So for example, when a team onboards, instead of just mapping it to a particular namespace, we sort of created an entity around it. So every team when they actually onboard, they get like something called an NS group. And so that is a Kubernetes aspect, but there is like Fertility-specific aspect in terms of like, how does it get integrated with the... So for example, how does the AD groups get created, right? So you have to... You cannot automate half of it and not the other half. So these are things that we did. So everything is multi-tenant. So in terms of the number of teams, it sort of varies, but like easily you could find like 52, 75 teams working in a cluster. I'm talking about a team of let's say eight or something like that. And they cannot step over each other. There are frameworks built around resource code and limit ranges and stuff like that. For example, the NS group concept, it sort of includes a section for the resource code as well. As a cluster admin, I can actually say this team gets this NS group with certain resource quotas and stuff like that. So which means the team themselves can actually go and add and delete namespaces within that NS group, but it is actually bound to a particular set of constraints. So it is multi-tenant as most of them follow typical approach of having a cluster admin and that usually there is less dependency on the cluster admin. What I mean by that is you go to him for the initial stuff, where he sort of onwards you and sets the constraint. After that, we try a little bit, not a creative process where you have to go to the cluster admin again and again. So it's mostly their self independent. We do have pipeline setup, but pipeline is something which sometimes it can open so they can actually bring their own pipelines. So most of them use their own pipelines, deployment pipelines and stuff like that. So that's on the cloud side, Dave, if you want to add on. Yeah, from the on-prem side of things, again, we've adopted or prescribed to the notion of smaller clusters. Rajan mentioned medium-sized clusters. You know, initially the thought behind it was maybe we'd go with larger clusters, but ultimately it comes down to blast radius. When we've learned that the automation that's involved with maintaining maintenance around these rehydration, it gets lengthy, right? It takes a lot of time to rehydrate a thousand nodes. So by breaking it down into smaller pieces, we're really able to create decision points where we can decide if we need to move on with other clusters, if say we ran into a problem, or if we can just continue or stop dead in our tracks and revert or hop, so to speak. So the smaller clusters, multi-tenant, mostly business aligned. So a lot of our clusters are specifically business unit aligned so that it creates that separation between our business partners. And then within those clusters, they're definitely multi-tenanted, where all the different development groups are working within that cluster, separated through many namespaces. And in non-production, we see a lot of those namespaces delineated by their various development cycles, right? So, you know, developments, and for the most part, those non-production clusters tend to be on average at least two times larger than our production clusters, just based on the number of workloads and the various environments for them to cycle through application development before they get to production. From an architectural perspective on prem, I mentioned earlier that we're predominantly on vSphere, and on top of that, we front all of our clusters with obi load balancing services. And those, that obi load balancing service sits, you know, pretty much as a NO4 proxy down to our, either the Nginx Ingress Controllers, which handle that path-based routing. And then also we allow them to also use like node port ranges, right? So that they can do direct pod traffic. So, yeah. Cool, so that spurred a few questions. Yeah, I think so. So the big thing here is just around monitoring and observability of your clusters. Like what tools do you use? How do you kind of collect metrics, logs, traces, all of the normal observability things? And how do you monitor the health of the workers? Yeah, we will talk about it. I think there's one question on the ownership of the cluster. So, basically, as you said, the business unit are our internal clients, right? So basically, we sort of provide this platform and certain, it's basically a framework with a collection of tools to manage it. The ownership of the cluster is actually with the business units. So the platform has defined its own set of roles. So there's something called the global admin, platform admin and stuff like that. So the business unit, they will have DevOps, SRET, so they'll actually be the cluster admins. So we do have overall access, but they are the actual cluster admins. So in terms of upgrades, if there are, for example, let's say if you're having issues with the resource code and stuff like that, so that's where the users actually go to. So that is the ownership of the cluster. And basically, the ownership is sort of divided in this way. Whenever I say platform, end of date, it's a collection of namespaces. It's more than that from a Kubernetes standpoint, it's a collection of namespaces. So if you open up a fertility cluster, you have all these set of management namespaces and system namespaces. Anything with hyphen system is a system namespace, similar to system where all these critical add-ons runs. And then we have a collection of management namespaces where all these, right, from cluster on a scale or ingress control, all these things run. Those set of things put together as a platform, any issue happens there, we are responsible for it. So the cluster admin doesn't even have to look into it. It straight away comes to us, because it's a platform issue, platform is unstable. And anything other than that, let's say if there's an issue with a particular resource code element range within a user namespace, and that's where maybe the cluster admin will come into picture. Other than that, we have another role called namespace admin. So if I'm the owner of the namespace group, right, I have a collection of namespaces, I'll have admin access to that namespace, which means I can do whatever I want within that. For example, let's say I'm trying to install, let's say, some sort of a CRD-based operator, right? So there is a particular automated process where you can actually go and submit, where a particular add-on within the platform will sort of create the customer source definition for you from then on, you're on your own. So that's how we have sort of done it. So coming back to the monitoring strategy, we basically have, it's actually a mess mix, but we have a combination of Datadog, Splunk, and stuff like that. So we have a collection of very good collection of pre-built dashboards. So at any point in time, when you have 300 clusters, right? So look at it this way, each cluster, each cluster has this collection of namespace, which I said is a platform. So within these 300 clusters, if the platform is unstable on any of these clusters, we'll get to know. So that's how we have actually set it up. So from our standpoint, we just look at a particular platform version. When we release platform version 1.0, we just, we know what our clusters are upgraded, what our clusters are not upgraded, but at any point, we would get to know if platform 1.0 has issues in any of the clusters. So we use Datadog and combination of Splunk, and we have all these pre-built dashboards. So yeah, and we use like metrics heavily, metrics logs. Traces is some of the, I think some of the community projects have it, some of them don't, even like our internal tools that we've developed, I think they're still in the process of making the best use of traces. So I think we are getting there. So in terms of monitoring again, I think that's separated. Any platform related comes to us, but anything which is like application related, it goes to the in-space admin, and then if it is anything on the other side, it goes to the cluster admin. So that's what we have sort of separated it. So if let's say an application team has an issue with their deployment, then it doesn't come to us at this point. I just wanted to touch upon a little bit on this, on something which you are actually working, just to, maybe it's used for the users, maybe they can actually think along these lines. Let's take this problem where you have this deployment pipeline, right? So at this point, if a deployment is having an issue, we have an SRE team, it comes to us sometimes, but most of the time, usually what happens is if I'm a developer, like mid-level developer, with four, five years of experience, I usually go to the team lead first, and then team lead will actually go to the business unit DevOps teams, right? So what we are trying to do now is sort of as a part of the platform, right? We are trying to actually come up with another, you know, sort of a system where, imagine this, this is what we are trying. So you have a Jenkins pipeline, for example. Imagine we give you a Jenkins plugin where anytime your Jenkins build fails, it prints out a link and you click on the link, it tells you what the problem is, right? So that is something what we are actually working towards. Hopefully we'll actually open source it. We are trying to build some machine learning models and then do some analysis on top of it to actually come up with these things. The reason I mentioned that is now we are actually trying to take it a step further in such a way that each developer who's actually the user of the platform, we are trying to focus on the pain points they have and then try to, you know, solve that. So I just want to touch upon that a little bit, but I think going back to the monitoring again, maybe Dave, do you want to add something on it? Sure, yeah, you know, one of the questions that was on the board was how do you monitor worker health? So, I mean, we do have some, you know, basic, you know, things for workers that just ensure that the virtual machine exists and it's responsive. That's really just basic monitoring, but the monitoring itself really comes from data doc, right? So the data doc monitors that we've set up are specifically looking at the components within the clusters, right? So for instance, you know, if you cooblets down, well, then your node is not going to work, right? So from that perspective, the health of that node is inoperable, right? So it's not functioning. So that's kind of how we prescribe to it. So a lot of the monitoring that we're writing that we've written are really around like those service component health status. From a logging perspective, like he mentioned real briefly was, you know, we do use Splunk. We have kind of a mixed bag of logging. We use data doc in some areas. We use Splunk. And we also have a team internally that has built out some really interesting architecture around like an aggregation tier around FluentD. So basically our FluentBit log collectors that pull the logs off of the nodes will push those logs to FluentD aggregation tier. And that aggregation tier then pushes those again to Kafka topics. And then those Kafka topics are read by Elk, right? And that's how we were able to kind of use Kafka as a, you know, almost like a traffic manager, right? Where do I send logs for these specific clusters? Because there are different requirements that come from business lines around where they want their logs to land. So yeah, I hope that answers some of those questions. Yeah, and I just wanted to add something to it. There was a question on the, how do you monitor the worker health? I think the agents, the data doc Splunk agents that we have has these by default. But I think on top of that, we have deployed as a part of the platform, one of the add-ons, if I'm not wrong, the node problem detector add-on from, I think it's part of the Kubernetes project itself, node problem detector. So I think we are sort of using that, that actually helps as well. But I just wanted to mention one problem, which we have, right? So today, look at it this way. So let's say there is an application deployment that failed, right? And if you look at the logs, it'll say Helm release time note. And if you run Kupka, it'll get parts, it'll say part spending. Then you will figure out why the parts are pending. Then you'll see your nodes are unstable. Then you'll figure out like why the nodes are unstable, it'll be something to do with cluster auto scaler or something happening on the network side that is affecting the AWS auto scaling groups. There is like a chain of things. So one of the pain points that the developers have today is when something like this happens, even with the current solutions, if even if you set alerts, if you open the mailbox, you have like a flood of alerts. So it's not as if like someone tells you that, hey, all these problems are happening, forget about it, just fix this one, right? This is what you need to focus on, everything else will happen automatically. So this is a problem that the project that I was mentioning earlier, based on AI and machine learning and stuff like that, that's something that you're trying to solve. When a deployment fails, when they click on the link, you want to tell that, hey, there is this network issue happening and your auto scaling group is having an issue, you're someone is working on it, don't worry about it. Rather than spending a list of commands, which basically, so that's sort of a correlation analysis. These are things that even though we have pretty sophisticated monitoring setup today, these are problems that we still have. And sometimes like, let's say there is like a network outage going on that is affecting a lot of things, right? From a developer who's just looking at a Jenkins pipeline to for him to get that information, it takes like hours. Sometimes he raises a ticket, someone has gone for lunch, they come back, they look at it, they raise something in the Teams chat and then somewhere they get to know that a networking is working on it. This correlation, but if you look at it from the way, we are platform team, right? But the way we look at it is they are users of our platform and this is a platform experience. So we want to enhance that. So we are sort of investing along with the existing monitoring and stuff, whether we are sort of investing in our efforts around the area where how do we use latest ML techniques to sort of make this better for them? And the question was, how do you, how does one differentiate between logging and tracing? So as I said, most of the add-ons, I'm not sure if they do a lot of tracing. At this point, most of the things, most of our monitoring starts with metrics and then from metrics, we sort of try to correlate to logging. We have seen that whenever you have tracing, that is the best thing you have, right? You start with metrics, that's where you get the alerts, then you go to the trace and then you get the logs. But yeah, at this point, everything starts with metrics and then it sort of goes to the logging. But some of the latest stuff that we are trying to do based on the machine learning, it's actually the reverse. So I'll start with the logs and it's interesting. It's for the future, but that's one thing we've been doing. And the question was around, what is the component that actually reaches out to the Kugler API? I didn't get that. So basically, in terms of the communication between the control plane and the Kugler, that sort of differs from, for example, the EKS is slightly different from Arrancher. How do you do inter cluster networking? That's a very good question. That is one of the, we still have that as a pain point. Let me tell you. So it's not a problem that we've solved. We are working on it. There are solutions that we are still looking at, but I can say that's a problem that we have not solved. Basically it's still, we're still using external load balancing services to handle inter cluster traffic. We haven't come up with any type of a solution to that, like a service measure. Yeah, nothing like a global service measure, anything like that. It still goes out, comes through the load balancing and then comes in. So the NY stuff that I mentioned, or no stuff that I mentioned was, so there are things that is getting built on top of our platform. So we have 3D case, right? So we have all these, for example, there's like an ML platform that we're trying to build on top of 3D case. So similarly, there's like an AP gateway that gets built on top of our 3D case. So the 3D platform that we've built over a period of, now, when I look back over the period of two years, whatever we have done, it's now it's like a solid foundation where people can actually build on top of it. So internal clients, one of the business can actually build the layers concept that I talked about earlier. It's basically a collection of items. They can actually now contribute and say that, hey, I have this machine learning set of features available. I'm packaging it as a layer, applying on top of your platform that becomes like your ML platform. At the same time, it's an ML platform which with all the fidelity constraints set to it. So the on-wise stuff that I mentioned was around those lines where there's like an AP gateway that is actually getting built on top of our platform and that is where it is actually used. So... Just a lot of time, but yeah, if you can get through the last question or two. Yeah, I think there's something we should, maybe if it is like even two hours, we can, I think we have a lot of stuff to talk about. Looking at cluster to cluster part to part, we do have teams using Calico within our clusters. We stick to, at least on the cloud side, I think maybe an answer, it's different, but on the cloud side, we stick the native CNI drivers. We don't use overlay, at least at this point. So we have teams which, this is something which we don't enforce, but we have teams which basically uses, they can actually install, it's not part of the platform yet, but they can actually install Calico on top of our platform and then do the network policy as to fill that. Yeah, we're using canal on-prem with overlay networking. In the public cloud, they do not use overlay. And as it's a cluster to cluster, it's still something which goes out and then comes through the embrace. So we don't have any global service, make sure anything like that at this point. All right, that makes sense. I guess with that, we can wrap up. I just wanna thank everyone for joining us today for this episode of the Cloud Native End User Lounge. And it was great to have both of you, Rajan and Dave on to talk about fidelity. And we had some great interaction and great questions from the audience. And we bring this End User Lounge to you on the fourth Thursday of every month at about 9 a.m. Pacific time. So hope to see you the next one. And don't forget to join us for KubeCon Cloud Native Con North America, October 12th through 15th to hear the latest from the Cloud Native community. And also if you'd like to showcase your usage of Cloud Native Tools as an end user, join the end user community. There are a lot more details on cncf.io slash end user. And again, thanks everyone for joining us today. And we'll hope to see you next time. Thanks for having us. Bye. Bye.