 Hi, and welcome to this session at KubeCon. I'm Cheryl Hung. I'm going to be the host for today's panel discussion. Today, we're going to be talking with members from the CNCF end user community and their experience running GitOps in the enterprise. So CNCF end user community is this group of companies who are adopting Cloud Native and Kubernetes and who get together to share their experiences and figure out what challenges they can overcome together. And I'm extremely honored to have three members of the end user community joining me here today. And I'd like them each to introduce themselves. So I can start. I'm Cheryl. I'm the VP of Ecosystem at the Cloud Native Computing Foundation. Matt, please introduce yourself. Hello. I'm Matt Young. I'm an architect at EverQuote on our cloud engineering team. Hi. I'm Arab Delhanan. I head the Cloud Platform team at Fidelity Investment. I'm also a CNCF board member. Hi. My name is Fabio Giannetti. I'm leading the internal cloud group at Mastercard. Fantastic. Thank you all so much for joining today. So we've got a couple of questions that we're going to start off with. And we're just going to talk amongst ourselves and learn how you've approached these questions at each of your companies. So first thing, what is GitOps? Let's define GitOps and why your organization chose to adopt it. Fabio, let's thank you. Yes, thank you. So for us, GitOps really has been a journey to represent our infrastructure through code as well as what we define the core application or those applications that we deploy on our Kubernetes cluster to run the infrastructure itself. And the reason we have chosen this is to allow us to represent everything we have in our internal cloud as a Git commit or Git repository. And why did Mastercard choose to adopt? We did choose to adopt it because it simplified our operations and it allowed us to have a good handle on the different teams and the different environments where we are deploying our application. So it become very easy to track everything that we run on our cloud environment. Anir, is that the same for you, Fidelity? Yeah, similar to Fabio and the Mastercard, like Fidelity, we're a platform team that's responsible to build the platform with platforms. So we are responsible to enable and govern and regulate about like hundreds of clusters and platforms, like different ones, different varied, dependent, different business units and different capabilities and features as well. And that was our way of scaling that platform through GitOps and building like full automation around it. Right now we're running over like 400 cluster plus using that method. So over 500 clusters, you said? 400 server-hunt customers. Cool. And what about you for Matt? I know you run a bit different scale. Sure, so EverQuote is a fairly young company and we've been growing rapidly over the last few years. We, our cloud engineering team is a fairly small team embedded within an organization. Our customers are engineering teams and they all like to run fast. So we've spent the last year rebuilding our infrastructure to set us up and position us to support all of this horizontal scaling, if you will, of the business. And in order to do that without, you know, when you're putting the size of our team, we've really leveraged GitOps into context one for our core infrastructure, as others have said. And it's really for us about toil reduction, disaster recovery, but perhaps most importantly, self-service mechanisms. You know, all of our infrastructure is described in a variety of ways, but it's all in Git and we partner with a lot of our teams. So if they need to do things faster than we can, PRs are welcome. So we've been quite successful over the last few quarters of employing a lot of open source methodologies and workflows for how we're managing our infrastructure. Additionally, and I'll be brief, teams are, our engineering teams are also beginning to adopt some of these workflows as well. Although their scenarios are somewhat different, a lot of the same benefits have been realized. What do you mean by some of their workflows have been a little bit different? Can you elaborate? Well, sure. So, you know, like many companies in our position that are public in the last few years, perhaps, or are rapidly growing, we have a variety of systems. Some are very legacy and we're very much made by hand. And then, you know, as the organization has matured, we've gotten more automation in place and more, you know, automated deployments, but we still have a good number of systems that are not today described declaratively in Git. There's an operational or procedural aspect to them. So we are somewhat polyglot in our CINCD mechanisms. For example, we have bamboo as well as FlockCity. We'll get into that later, I think. We definitely have some teams building that new things and generally those, we would get ops methodology, but we have to be realistic that it's not the hammer for every nail, I don't know if that's supposed to be. Amin, did you have something that you wanted to add to that? Not really, like in just like in one last comment is like, you know, it was very useful mechanism to enable our platform in multi-cloud model. We have like different cloud providers and different infrastructure beside our private cloud as well. And having that like, you know, differentiator in the layer underneath the Kubernetes itself, where we have like different network model and different like, you know, mechanism and visualization model as well and programming underneath the hood. That were actually the tops was like, is abstracted layers that kind of minded all these components and is all these cloud platforms together. All right, thank you. And let's go to the next question. And I think this is the one it's almost jumping to the end immediately, but what lessons did you learn? I'd like you to talk about like the scale of what you're doing, maybe the journey that you've taken to get here, what you would recommend for other people. Amir, why don't you go ahead? Sure. So I would say like, you know, if you was the lesson learned is a number one understanding the complexity and the whole like, you know, big picture about like where are you going to be used and how we can implement that and how we can introduce that to the development team was building that GitHub model. That was one of the challenges or one of the lesson learned. The second one was mainly, as our team is more of like the engineering team who's handing over that automation piece to all the DevOps communities inside within our business unit. We have about many of them actually in the company. So we'll enable in this culture and let them understand like how that will work and how that is different than the traditional operational model. How the declarative aspect of Kubernetes and how the declarative aspect of managing your infrastructure, that was new. That was unique and new for them. The results like this aha moment, like after a while we're like, how we do that? Where is my process? Where is my validation process? But after a while they get used to this idea. It took a while. Yeah, and I can go next. For us, one of the very interesting lesson learned has been around audit trail. And so because we use GitOps to drive upgrades and do deployment and changes that reflect all the environments including our production environments, the company has set the audit trail based on traditional ticketing systems. And so one of the issue we were facing is that those GitOps operations needs to be mirrored into the audit trail. And so we've spent a significant amount of time to build some automation. So when you do a commit and you create a pull request we open a ticket automatically. And then the approach of accepting the ticket will automatically give a plus one into our pull request. And we made a plus one mandatory so the operation, the contract will be merged unless the plus one exists. And so this has been for us a game changer because it allow us to be so compliant, PCI compliant. It made us work very well with the rest of the company without forcing us to change the process that the company follow on audit. Yeah, that's a great point. And we have similar concerns as we partner with a number of insurance companies and other entities. We have a lot of regulatory and compliance issues, not issues but concerns. And one thing about a GitOps methodology is effectively access to Git and the role-based access control around that is really important to think through because now Git is your control surface. And there's different ways to address them and we're employing similar approaches with automation around web hooks and PRs and things like that. We've had some other lessons as well in the last year. At the top level, I would say UX really matters developer experience and usability matters. For example, we're using Flux CD for most of our infrastructure that is in Kubernetes and it's working quite well. But the primary user of that is our cloud engineering team who knows that full stack. For a new developer or a new development team that's just getting started, the UX of that is really look at logs. And so there's lots of different tools you can bring to bear, but let's think up front about who's going to use it and what interactions do they need and what workflows need to be supported when choosing tools. I'll also say that particularly for Kubernetes workloads and Kubernetes itself, one thing that GitOps will really shine a light on quickly when you get to scale out horizontally is how are you dealing with configuration management? You know, between Helm, Customize, and Tonka and or JSON, there's a variety of approaches to not just end up with a bunch of copy and pasted YAML everywhere. And so it's worth investing some time in deciding how your organization wants to manage configuration and how you want to abstract out the differences. The rabbit hole runs deep and there's a lot there. We, in fact, use all three of those tools. It's worth bearing in mind up front. I'll also say that in terms of workflows around branch management, promotion of code between environments, how are you running things? When you move to GitOps, if you have gaps there or if you have varying points of view in your organization, you're going to find out quickly. So again, I'm up front to work out how as an organization, you want to do that and if you want to support different methodologies, just that's all stuff that you really want to think about prior to diving in with both feet. Yeah, absolutely. I've heard the same from quite a few different companies. Matt, you already mentioned this a little bit before but let's talk a little bit about the toolings and the projects that you're using in your companies to kind of implement GitOps. Amir, can I start with you? Yeah, sure. So like Matt mentioned, actually using several frameworks around GitOps and in companies in general, so Flux is one of them, Helm is the second one. And in addition, we had to start using as well like other project outside of Kubernetes itself to complete the picture. So completing that ecosystem. So for example, like Fluenty, Betz was required for us. It was part of that ecosystem that we had to build and use as well about. Prometheus was another component with Grafana as well where they had to complete like because you need that reporting piece and you need that kind of like observability front for your cluster management side as well. Yeah, and from my side, we value a lot the landscape. I think it brings all the technology. It gives you the ability to understand the level of maturity. So personally, we used customized, we use Helm but we don't deploy Helm directly. We use a templating capability of Helm to convert it and then we lay customized on top of it. We also experimented with Dargo CD. We like the declarative versus imperative approach of deployments. One of the things that we would like to see more on the landscape is to go more into the details of Helm charts and operators. We use those quite extensively. And the technology is the Helm is a graduated project but inside Helm, many charts are in different state. And so we would like to see those being more mapped out or owned by the project themselves so they become more stable and more reliable. And the same applies to the operators. We had a lot of it to miss with the maturity and support and quality of the operators that there are around. Would you agree with that, Matt? You said you use Helm as well. Absolutely. So again, we started our journey with GitOps earlier at the beginning of this year and we have some of the same challenges around Helm charts. Helm moving to its current version and getting rid of Tiller certainly simplified things in that you don't have an orchestrator, a top orchestrator, so to speak, in Tiller. But we're doing a similar approach today where we're using customized combined with Helm to pre-render templates so that it's very obvious what's happening. And we're actually beginning to explore using Tonka and Jsonnet as they've recently added the capability to import Helm charts into a Jsonnet project. But we've certainly evaluated a bunch of other things as well. We've looked at everything from Spinnaker to Argo, GitHub Actions, Bamboo, FoxCD. There's a lot of wine and that just kind of scratches the surface of what's there in the community. But for us, FoxCD has been very good, particularly because we have quite locked down environments. We have a lot of security concerns. And so Fox in particular, since it's effectively a bot sitting inside the cluster reaching out versus something like GitHub Actions or more traditional things where from outside the cluster, you wanna authenticate over the internet or over just somewhere else to the API server of a Kubernetes cluster, we don't have to have credentials in case to the kingdom stored in third party systems if you're using a SAS variant or in other internal systems, it's a pull-based model. Also consistent with what the others have said, in order to make all of this a reality, we've adopted a largely CNCF based stack, which includes Prometheus and Grafana. We've also, when evaluating, what should we use for logging? For example, we're using the Bonsai Cloud Logging Operator because it's declarative in nature. So once you start down this GitOps, and you start seeing the benefit of, oh, I wanna change something, I'll make a quick PR that bumps a version number and then like done, right? No toil, no drama. You start taking a really hard look at systems and other building blocks of your infrastructure that are not described declaratively. And that's been one big mind shift, I think, in our organization that it's just a, it ends up as a filter you almost apply to say, well, I have this wonderful thing in place that I don't have to deal with because it's automated. I don't wanna add something that has a manual step to it or that is not declarative and idempotent and all of those sort of things. Yeah, and I wanna build on the point that Matt was making on giving the keys of the realm. So one of the approach we use is we create what we call a control cluster. So it's the cluster where the deployer sits. So our Argo CD will sits in that control cluster which only the platform team has access to. So when an application team goes and made a pull request to deploy something, in reality, the tool that deploy sits on our control cluster and that's locked down and only we are the only one that has access to it. So that's been an extra step of security that we put in place that our security team liked because they reduced or limit the surface of attack. And in our case, actually, we build our own framework. We actually, it was released like a couple of weeks ago to the open source communities as well called Kiran. And that's kind of, it's a management here for all the operators that we run today and we government also rise within fidelity ecosystem. And it's based on GitOps as well, it's based on Flux. And that framework model basically will allow like the system admin and the communities administrating communities inside fidelity to state what version of what controller and what operators is being used for each one of the releases and manage that and push that through GitOps. So it's pretty cool actually because it does allow us to extend the number of operators, it's also, it puts them like into multiple level with its production, grade operators or experimental operators that can only get to a certain level of maturity and can be all go on ego for like a level for development purpose or so. In additional, it does have like inversion management control as well around that. And it does follow the same approach for how we can with GitOps, you can manage this automation, manage this complexity across like 400 plus clusters. Yeah, I think there are challenges to as organization scale. I mean, if you're a small startup and everybody can fit in the same room, you can just pick a stack or pick a tool and just altogether do it. In our case, however, we're only six or seven years old now, but we have existing processes. For example, some teams might use bamboo or some other system for Sarban's Oxley compliance so that we have an audit trail of changes to the code as well as the infrastructure. But then on the other hand, we might want to use flux. So I would encourage folks looking at this not to be too absolute and just keep focus on what problems you're trying to solve. For example, you could have an existing ticketing system or a workflow system like bamboo or Jenkins or whatever that is used for compliance and provides that audit and that control workflow. It could just merge a branch and then something like flux could pick up and wake up and do it. When you put these systems together and everything is moving in parallel, you can make a lot of work for yourself that you didn't expect if you say, ah, well, since this is GitOps, now GitOps has to be our control surface as well. So we need to get rid of things that work just fine and move everything to Kubernetes, RBAC and Teams and all of that. You can put these things together in a lot of different ways, some better than others, but keep an open mind and be careful of rabbit holes because some of them can go quite deep. You can find yourself doing a bunch of work that you hadn't intended on doing or that you actually don't need to do if you're not too curious about having just one pool. Yeah, that makes sense. We're loving Flux and we're using it for almost all of our infrastructure. Hmm, okay, awesome. Let's go to the next question. So next one is how has the life cycle management of Kubernetes been impacted by GitOps? Let me start with that. Like actually I'm going to mention like two examples that we did was a big impact in our Kubernetes life cycle management using GitOps. So the first one was maybe the cluster management because similar to Fabian and Matt, we have like a very highly regulated environment. So we have requirement actually to do a rehydration and upgrades and refresh for all of our clusters in monthly basis, which is, it's a challenge because basically you start to have to trade your Kubernetes clusters itself as more of like bits, not kettles, and it have to be rebuilt and it have to have the same consistency. You need to be up and running all the time during that and it have to be done in a canary model. And that's what we use like, you know, that approach in addition to the complexity of having all of our operators need to be deployed in the same version in the same manner and patterns for all these clusters. That to get simplified through GitOps because in our GitOps, as you can see in our Git we had three was existing for each one of those clusters. And that would have the specs for this cluster and have it look like. And when the deployment happened and all the upgrade and the rehydration happened it will take advantage of like, you know I want to move from a minor vision of Kubernetes to the next one or a major version or I want to patch my AMI like, you know, an update like the images that I'm using for my notes and so on. So it was a control management, it was very decent actually. The second use case I would call that had impact our cluster is large, it's multi-tendency. So we build using a GitOps model as well a multi-tendency operator. That operator will guarantee the same application on boarding. So for instance, you know using GitOps in this case you can define your application you can connect it to our ticketing system in the back end you can define who's the team responsible for that what rules they are. And that operator attack would take that it will create its own spaces. It will connect it to all the fidelity ecosystem in the back end. It will set up that name space and will set up the routing the ingress ingress is management and routing rules and firewalls or like old aspect. So at the end of the day, you know the developing team will get like a ready environment or ready environment was then that cluster for the development effort. Again, GitOps was used in this case and it was actually very like, you know, successful like some of this cluster reach out like 270 name spaces or application running in shared environment. I wouldn't say there was another method that we can achieve this kind of automation without GitOps. Yeah, and I'll echo that the biggest benefit for us is the automation that we can drive with this, right? So when we have a minor release of Kubernetes we can basically drive the automation from our sandbox environment. If that passes the tests, then we can go up to the dev and we can do selective upgrades of different clusters. So we have a test cluster in every environment plus our control cluster that we can go and upgrade and then we can move to other cluster. And we have like a hierarchy of clusters depending on the impact that those application may have. And so we go to a certain order. The other thing we do is that we express when we create a cluster teams we have the ability to choose a type. For a type, it could be a single tenant. It could be a single tenant with PCI compliance on it. It could be a multi tenant or it could be a cluster that they use for specific things that may be created and destroy after a while. So all of that is expressed in Git and that will trigger the automation to build all of those, right? Before all this was done through tickets with our VM environment for instance all done more manually than what we were able to do. So definitely allow us to see what happened allow us to automate the majority of the operations meanwhile keeping compliance with it. Yeah, I mean, we're operating at a slightly lower scale by an order of magnitude at least in total number of clusters or total number of developers but I would definitely, again, it's sort of a lesson learned but there's a lot of edge cases when managing Kubernetes clusters generally. We run across multiple clouds we're running both GKE and EKS and production and we have a multi account set up and all of that. There can be, for example, race conditions when deploying various operators. There are things that should just work but in reality, sometimes they still do need human intervention and depending on the size of your team and your ability to completely automate everything you might have to make some decisions as we have to automate some things that are readily automatable but take care with others. Going back to the Helm bits, another just small concrete example is Helm has a plugin model. So you might use a Helm chart that when doing an upgrade actually executes things, right? To help my mutate or migrate state but if you're taking a templating approach and you're not actually running Helm against in this way, you might actually inadvertently not realize that some upgrade to a Helm chart that was put in uses a plugin and that's not being captured by your workflow. So there are exercise caution and just don't assume that just because everything is automated everything's working. So investing in things like Prometheus and alerting and how you handle when things are not going well is really critical because particularly with Kubernetes as anyone who's used it knows there's a lot going on and it's very easy if you're not conscientiously and intentionally alerting on things to not realize that something might be not working so well and if you have a small team, you might not notice. One thing we do in now, sorry. One thing we're doing now are upgrading the sandbox to validate that the upgrade happened successfully. We ran Sonoboy and Qbench to basically see that the final result of the upgrade has some sort of compliance to, or final status compliant with CIS benchmarking. And then we run some synthetic tests to see that you can deploy an application comes up and all of this kind of stuff. So that's been a way for us to enable some automation around upgrades for, especially minor upgrades, more than major upgrades. Yeah, Sonoboy is yet another gift from Heptio to the world. For sure. That's similar to FAPI actually. We have this similar process. We have our own validation process that run after most of these processes, right? And we had to, we had a full framework like using Cucumber and others actually to do like validation for each one of these operators. Some of them are very critical, like our secret management integration piece, for example, like, so that's a required, that's a must. And these pieces is like need the validation. So after each one of the upgrade processes and so on, we have to run like a full validation process and then all the aesthetic monitoring is after that. That would be very similar to what you guys do in MasterCard. Yeah, actually the experience of the last year has had me taking a hard look at what we do next year, around should we upgrade at all or should we just install new clusters at the new target version and use DNS load balancing or canaries? We're also running LinkerD in production across all clusters. So there's multi cluster routing capabilities that weren't there a year or two ago that also paved the way towards just not dealing with upgrades at all. I mean, if you've ever worked on a super project, upgrades are always hard. They're just, they're always hard and they're complicated and they're easy to get wrong even with diligence. So we're exploring would it be simple to just move to a full disposable cluster model? In reality, I think we're not there given the size of our team. And if you're using some partially managed Kubernetes offerings like EKS or GKE, they have their own upgrade workflows and provisioning a new cluster and provisioning a hardened multi-tenant scalable cluster that has a lot that you put into it to make it that way. The challenge for us between rebuild and upgrade is mainly the ecosystem around Kubernetes, not the Kubernetes upgrade itself. So the difficulties like around everything else around like all the security aspect, the regulation, firewalls, the configuration, the authentication authorization back to some of our data center like resources and others. That's where we see that challenges are like for rebuild besides rehydrate every month in every way. So I have actually a question if you guys don't mind like I'm very curious about it. Did any of your development team experience get ops for application deployment side? Yes, we have, well, so I'll say we are a work in progress, but earlier this year we had a couple of applications moved to using flux for application deployments. I'm enthusiastic about what comes next for flux with the new version. I know that the project is doing some work around observability and then making the UX around that a little easier. I think for some teams that we have that have onboarded their first few services to use flux, there's been a little bit of a learning curve. We've also had really good luck with our email remarketing team using Loki to help understand what's happening in flux. So we've got dashboards, you know, they've got what's happening with the service and what's happening with flux. But again, like that tool in particular is sort of watch and look for things in logs. It doesn't have a great UX that's broadly accessible. For us, it's maybe more complex because the application teams are basically, they have the, you know, CICD pipeline around Jenkins being defined and used for a very long time. And there are a lot of security hardening and things that have been built on top of the year. So going there and say, hey guys, replace all of this with something different. Eventually it will, but I think it's a longer conversation and it's going to take a good amount of time. We've been talking a lot about infrastructure, which particularly with Kubernetes, you can very quickly deploy flux, rather get ops mechanisms, irrespective of tools and very quickly get some wins. But when it comes to applications again, just like, you know, it'll shine a light on things like oftentimes the tricky part about having applications using get ops comes with everything around that application. So we use Terraform heavily, right? But how do you actually deal with, you know, all of these things outside of the actual service itself that it uses, the databases, the buckets, the queues, all of that. How do you make that work as well? And there are approaches to do that as well that we're using, but there are complexities around that. Or for example, many frameworks, the first time the new version of a service hits an older schema, they'll do a schema migration. Well, one of the cool things about get ops is that you can revert things. Sometimes those migrations don't go backwards. So there's a lot of things that like, once you get over the initial, hey, I could deploy my service again, like when you really think through, you know, all those things that are not necessarily declarative or if they are declarative, they're declarative plus like Terraform or CloudFormation or something that actually has to be run that maybe can't be rolled back just like, you know, a container version can. Yeah, I did an internal experiment. It was very, really an experiment where we use, we look at OM. And so I think, you know, application manifests which includes all the dependencies, connectivities and network boundaries are going to be extremely important to move to get ops for application deployments. And so the open application model is a new format that came out and still evolving, but it's a very good starting point if you guys have a look at it. Yeah, I think around January I first saw the initial specs for CNAB, CNAB, we're not using it yet but I've been following the project over the year and I'm quite excited by the potential there. Again, it's a cloud native application bundle and I'm not involved with that project and we don't have, you know, I don't have my hands dirty from playing with it yet but, you know, things like that that encompass not just the application but all of the surrounding infrastructure and tooling that can be packaged together in a version bundle. That for us is, I'm quite excited about that in particular and a lot of our folks use VS Code and some of the integrations for, I thought, some of the other related tools to CNAB promising as well now that they've matured over the year. We're pretty much running out of time so I'm going to have to go on to the next question sorry to interrupt the discussion. I'll just give this to Amir to answer this one. So what does membership in the CNTF end user community mean for your teams? It did mean a lot. I mean, it started number one, I think for Fidelity in general it did reshape our like multi-cloud strategy. So right now we have like a very united strategy using like, you know, multi-public cloud provider even in our private clouds using the Kubernetes technology and container technology in general that builds the sportability between the cloud provider. It did united that the company was one direction around how to manage your workload across multi-clouds. I think it did also add a lot of value for our teams like, you know, going through the ecosystem of CNTF to explore to all the project that's going on right now understanding what's coming and where is the community is and where to invest. It did also shape our like, you know, work model. So like in the past few months we started releasing our own open source project like we released like tools and so far connectivity to what we call key connect to connect to multi-cloud, multi-custards and multi-cloud with different authorization models. And the second one, she's Kira and the one that we highlight the dark area around like how to deploy like and govern operators management. So that she reshape our development model itself like how we are developing and building all this like management tools for our environments. So I truly appreciate like the membership. I think it's great. Fantastic. And I'm really happy to hear that your teams are getting more involved with the community and with open source because that's how all of this runs together. We're pretty much out of time. I really wanted to say thank you to Matt and Amir and Fabio for giving your time and talking about the challenges and the experiences that you faced. If you want to come and join the end user community and meet other people like this then please go to that link at the bottom. And that is it from us. Thank you so much. Thank you. Bye bye.