 All right, welcome to our session on a day in the life of the GitOps platform team. Maylar, as you introduced myself earlier at the panel, I own and drive the GitOps topic at State Farm and I'm joined by, I'm Priyanka Ravi, I also go by Pinky, I'm a software developer on May's team, the GitOps platform team. All right, so every organizational change, especially the ones brought forward by automation, it's really, it all starts with an idea and really that idea is an attempt or an effort to make things better or solve a problem. And GitOps is one of the many results when we as an organization decided to focus on developer experience. How do we enable our developers, our 7,000 developers to realize innovation into the hands of our customers, our customers being either our agents or our digital customers accessing statefarm.com. So when we first heard about GitOps, and this was back in Dallas, we have a hub presence there, we got introduced to the concept of GitOps as we were figuring out how to efficiently deploy to our on-prem Kubernetes cluster. As soon as we understood the gravity and the benefits that we're gonna get out of GitOps, we knew we needed to go to work. And by we, I'm talking a three-person GitOps platform team that started this all, myself, Russ who's our engineer who's probably watching and Pinky here. We were OG of the GitOps platform team. So way before GitOps was generally available at Statefarm, we spent cycles in making sure that everybody gets excited about GitOps. Yes, we would still talk about the current deployment release process that we have, but we really would spend a significant amount to get everyone excited about GitOps. And this is to complete that GitOps pitch to get this organizational change to happen. So out of the pitch, we're really different reactions. One, from our compliance folks, it's about, hey, proof to me that in the case of an audit, we are able to pass or fulfill requirements one through five. The second reaction is really from our first line leaders, the one signing off on changes affecting their product. And so it's a mix of, hey, I'm not very familiar with Git workflows, which was brought up earlier in a panel discussion. And on the other hand, managers who are really excited about the transparency, like, oh, I can actually see the files change, the actual lines of code changed to realize this feature that empowers them and enables them to better review the evidence of testing that go along with it, giving them the better confidence that this is good to go to production. Last and certainly not the least, I'm talking about the 7,000 developers who were really excited about GitOps because they're the ones who we're catering to. And I remember this one feedback from an engineer in Atlanta and he said, if y'all can realize this, if you can make GitOps a thing at State Farm, y'all will be our heroes. And so we knew we needed to go to work. And this is, I'm going to turn it to Pinky to talk about what we enabled. Yeah. So obviously in order to make GitOps a thing at State Farm, there was a lot of enablement that we had to do, starting with how do we even set up the config repos? How do we make sure that they meet the compliance standards that are set forth to us? We actually landed on using the Terraform GitLab provider. And that way we could actually take advantage of like infrastructure as code benefits, you know, reusability, manageability and lots of other things, right? So that's one of the big foundational things that we had to set up. Another thing that we had to set up is our onboarding app. So it's actually a UI that our consumers can interface with and get quickly set up using the GitOps process in State Farm. Another thing that we have are a handful of scripts and some run on one-off situations and then we have a handful of other ones that actually run on a scheduled pipeline. One of those is one that we call affectionately the enforcer. And basically that one runs nightly. It utilizes Terraform Enterprise and so that runs nightly to make sure that they're all still meeting those compliance standards that are set forth. Another one that we have is we actually utilize Vault as our secret storage. And then we have our cookbook, which is basically our documentation. It's a place for our consumers to go get announcements to have answers to frequently asked questions to like see some of our other enabled API and CLI. And then also the way that our consumers interface with us directly is through our Rocket Chat channel and that's where they come and ask us questions. And other people can kind of see the questions that they ask as well and get theirs answered as well. And then one thing that May's going to touch on in a bit is our monitoring and dashboards. And that's a nice place for us to kind of see where the trends are in our data. And okay, so you can see here that we have a diagram on this board. We do have GitOps enabled for all three platforms listed here, AWS, our on-prem Kubernetes and Cloud Foundry. And this diagram is basically a very vanilla workflow. It shows, okay, so when a general process, a developer pushes code to their source code repo. And once it goes into master, it'll kick off a GitLab CI pipeline that uses our internally developed GitOps CLI. And that will actually create a merge request into their config repo with just the config files that are necessary to do the deployment. And from there, there's the deployment mechanism, which is either in some cases for AWS, it's Terraform Enterprise for us. It could be a pipeline. It could be Flux for Kubernetes. And so that will do the deployment. And then the other thing that happens is another thing that we created internally. It's our GitOps API. And that actually kicks off using a web hook from GitLab that creates a change record in our asset management repo. And then, so AWS was actually the first one that we enabled at State Farm. And the way that we did that was we basically consumed our own dog food and utilized an app that we have internally deployed. And that way our consumers could see an exemplar. Our next one that we enabled was our on-prem Kubernetes. And that one, we, like I said a second ago, we utilized Flux as the deployment mechanism. And I will touch on a little bit more on the Kubernetes in a little bit. But basically our last one, and the most recently enabled one, thanks to Nimmi over there, is Cloud Foundry. And so that one, that one is, we actually ended up utilizing pipeline templates to make that process more friendly for our consumers. And then we also, so that actually utilizes the Cloud Foundry CLI. We had to adjust some of the Config repo modules. And then we also use short lived access just to fulfill deployments through UAA. So continuing with kind of like the Cloud Foundry side, obviously as the GitOps platform team, a huge part of what we do is migration all the time. So one thing that we're still working on is migrating from Jenkins to GitLab CI. And that is actually still a work in progress. Our Cloud Foundry users are mostly on Jenkins. And we actually have this dashboard that shows us where like that status is and we can monitor what the progress is on that migration. Another thing that we set up, so our, as I think May mentioned in the last session as well, our on-prem Kubernetes was already developed. So we wanted to update it and allow for when new people, when new namespaces are created, we wanted it to have a flux instance. That way it can be listening to their Config repo. And so we actually ended up enabling flux multi-tenancy last year on our on-prem Kubernetes solution. And then right now we are actually working on updating that from even flux one to flux two. So that's the process we're doing right now. All right. Another huge part of being a GitOps platform team is education. Earlier we said training, training, training, right? It's the best way to realize a large organizational change like adopting GitOps. So over time we have gotten, it's now become an expectation that as soon as we enable GitOps into one of our strategic platforms, it's a given that we're going to do a roadshow. We focus obviously on the core, the foundational pieces as Pinky walked us through, but then focus on how do you really apply your changes to this particular platform versus this particular platform. And as Pinky mentioned, we recently enabled it for Cloud Foundry and we just did a roadshow for that. All right. Beyond the roadshows, I touched on this earlier, we realized we're obviously part of a larger community. So we've really been intentional in sharing our progress beyond outside a state farm. And it's through that that we really got to connect with wonderful people. A lot of them are here to help get that, help that dialogue to start going and exchange ideas on how to solve those problems. So since 2019, Russ and I presented GitOps for the win. That was fun. Last year was Pinky and I when we talked about GitOps and Terraform, a match made in heaven. And earlier this year, the recording still out there when Russ and I went into detail as far as how did we do flux multi-tenancy? All right. So basically, I'm going to walk through what our day-to-day looks like with support and all that. But before I do, I want to mention that we're actually a five-person team right now. So we went up, we upgraded from the three-person team we started with. And it's nice because there's May, obviously she's our manager and she does a lot of technical stuff too. And then there's Nimmi who's more, she knows more about Cloud Foundry. She's our expert on that. And then there's Russ, our engineer, who is our AWS expert. Then I focus more on Kubernetes. And then we actually, our new member, Adam, he does a lot of our data-driven stuff. So that's really cool. It's like nice to have that balanced. Okay. So our day-to-day, we actually, we were getting really swamped. It was really hard for us to manage like keeping up with new tasks. And then also like trying to answer support questions all the time. So we actually came up with a system that we've been recently doing where we have a weekly support rotation. So one of us will be like on call and we'll be the ones that'll be monitoring that GetOps channel, the, yeah, the GetOps channel in RocketChat that I mentioned earlier. And that's actually been working out really well because we don't have to constantly context switch as much as we used to. Another thing that we do daily is we check on that enforcer run that I mentioned a second ago at minimum, just to make sure that the secrets were rotated. And then also just to make sure nothing funky happened. And then, so our, we're big about community within State Farm and also outside of State Farm. Within State Farm, we have frequent touch points with our platform admins just to make sure that we're all on the same page and kind of talk about if they've heard any new tech and we've heard any, so we're all just like, you know, kind of hashing it out. It's kind of fun. And then outside of State Farm, we are pretty active in like GitHub discussions about Flux, Terraform, GitLab, there's a bunch of other things. And then we were really blessed to have WeaveWorks do a little workshop, the Flux 2 migration workshop with us. And that was really cool because we were able to ask them questions, kind of give them feedback on their documentation. And so it was kind of like a nice little partnership. And then one thing that we do every week is we have an hour long office hours session. And we have like sign up sheet and people can sign up. And that's kind of a nice place for them to come get more hands on like one-on-one help if they need it. Next is the fun part, outages. Right? There's no support that's complete without having to deal with outages. And so as Pinky and I were preparing for this, we kind of reflected on what we're at least a top two, top two things that we had to rally around, drop everything we're doing, right? Recover our customers. So the top has to do with secrets. We all have secrets in every automation that we provide. These secrets have elevated permissions, whether it's in your Git repository or in your target cluster or in your target environment. Thankfully, those exposures that happen were internally initiated by our penetration test team. But the key takeaway is over time, we've gotten better at limiting the number of secrets that get passed around into the components or the solution staff behind GitHub. The less that we require, expect our customers to provide secret A, secret B, just so it can do such and such, deploy keys and et cetera, the better. We still follow the convention. We know where to grab the secrets from our vaulting solution, but we don't let them have to pass it to us and promote or have likelihood, higher likelihood of a compromise. The second, and this is funny because it happened while I was on vacation, and I came back and I was like, oh crap, type of deal. So flux get deployed. Those secrets are all still there. But if you look into any one of them, the flux get deployed, the actual data, the actual secret is gone. And so again, following your standard operating procedure, dropped everything we're doing, and we fixed it by rolling out just a simple bash script that looped through all the namespaces that were impacted, exact into the flux, the running flux pod, grab the identity that we needed and patch that secret again to recover our customer. But that's just step one. The real benefit is how do we mature, how do we learn from this? And so we started with, we know the timeframe on when the situation occurred. So it's a lot of combing through logs. So I say it's a lot of splunking. And then out of that, in addition to obviously understanding the logs, it's also working with our Kubernetes operators. And so it all boiled down to the fact that there is a worker node that had gone bad. It was KubeLit that would report itself as, hey, I'm still part of this cluster, but other times I'm not. And so, okay, that's a contributing factor, but what rendered KubeLit in that bad state? And really it came down to, again, lots of working sessions, but it's to see a nice stuff that was getting flaky. So it's unable, this particular worker node was unable to report itself. And so it got into a taint manager execution error that just dropped the flux pod. And so the takeaway though is, yes, now we understand why did this happen, right? And so we do have Prometheus alerts that fired up. At the time, we, the GetOps platform team were not included in that. So that was one change. Another key enhancement that we did, well, this is kind of a maturity aspect for us, is to actually drain a lot of those logs, all those rich events that are emanating from flux system, because that's where a lot of the flux resources are at. And getting alerted when things aren't behaving the way we expect them to. Operations. So GetOps has been around, or generally available at State Farm since January of 2020. And over time, we've gotten better at the metrics and observability. So it's an expectation, it's a given that a lot of the pieces that make up the GetOps solution stack come with Prometheus and Grafana, the alerts, and we take action. And that's for the onboarding app and the UI that Pinky walked us through. We also have pieces running in AWS. And we just leverage the out-of-the-box AWS services for monitoring those pieces. And chat with us, we're here, we're here all week. We're happy to chat about the specifics, the details, or hit us up on LinkedIn or Slack. So that's one side, right? It's about ensuring the availability of the components that make up the GetOps solution stack. But also, one other thing that we set forth for this year is how do we bring forward the value that our customers get from using GetOps? And that's really what's getting into a couple of dashboards that we took a screenshot of. One is three metrics that I'm really excited about. One is deployment events, how many people are actually using GetOps, and by which platform? AWS being the top consumer of GetOps at State Farm. Next is change lead time. And welcome to that further dialogue on this. But how we're measuring change lead time right now is how from the moment a code change is pushed remotely in a source repo, we captured that time frame from that point until that same commit SHA is applied and merged in your target config repo, which means it's realized into your production environment. Another side effect or another metric that got added as a result of change lead time is change side, which is super exciting to us. Because GetOps is about small miniature changes going straight to production. So because of that homegrown solution that I talked about earlier, where we're capturing a lot of those milestones as a code change progresses all the way to prod, we're able to harvest all that data and understand how many files were changed, how many lines of code were changed. And that's leading into the change size metric that we are now reporting to our customers. Next is hardening. I already touched on this earlier, right, about lessening the number of secrets that we pass around. But it also goes without saying that we embrace policy as code. For our Terraform-related resources, we use Sentinel. And for our Kubernetes cluster, we started using Kyberno to enforce tenant isolation. Last is really not the least governance. And a lot of this is leading up to governance. We were good at gathering the data, making meaning out of the rich GetOps deployments. And now this is the recent enhancement that we just added. We are able to, in real time, alert the necessary stakeholders when there is a merge request that gets applied in the Config repo without any prior approval, without any first line leader approval. Because at the end of the day, we still have that compliance requirement that we have to abide by. But just the beauty of it is it's an instant notification that, hey, you're kind of going beyond the bounds of what you're allowed to do with GetOps. We are going to open it up for any questions at this time. Are there any questions? Oh, no. There are questions. Let me jump one over here. With Cloud Foundry recently enabled with GetOps at... Oh, yeah. So the question was we mentioned that one of the metrics that we capture is the files changed. And the question was, are there any other ones that we monitor as well? Or measure. Measure, sorry. That's a great question. And we can tag team on answering the ones that we remember. We're heavy on measuring adoption first with Cloud Foundry recently enabled. Another bit is, here's a product. We're hybrid cloud. We're multi-cloud. So we have the ability to capture, here's product A. And where does it have footprint in our many strategic platforms? Trying to think if there's anything else that's kind of worthwhile. Even our cookbook. We measure how many folks really get value out of the different sections. And it's funny how whenever we do a roadshow, you'd see an uptick on a certain section in our cookbook. But again, it's all data driven decisions. Even earlier when there's a question about deployment events and when you have massive changes, well, we have visibility into how often, what time of the day, and what time, what time in the week, what day in the week, do these deployments happen? So we have an informed decision on when to do sweeping changes like this. Does that help? Okay. What are some of the challenges that you faced in creating your multi-tenant cluster? And what did you do to be successful in that area? Oh, yeah. I guess I should. Okay. So that was actually like primarily, I think I took the lead on that last year. Yeah, there was, it's a little hard, I think, too, when there's an established platform team. You kind of have to get a working balance. Like you have to make sure that you're not stepping on toes, but you want to get them on board as well. We were lucky. They're very willing to hear us out. And you want to describe who's they? Oh, they is in the platform team, on Prime Kubernetes platform team. They're already there, right? They've been there for years. Our operators. And basically, it was, I mean, it was really nice because there's a lot of documentation out there on how to do it. I don't think we found too much hardship on that. It's nice also having a dedicated GitOps team, because that's our jobs, right? We were able to mob on it and make it happen and like actually sit there. There were countless meetings. I mean, that's what we're going through even now. Every morning we have a recurring meeting on setting up Flux too. So basically it was just a lot of trial and error. We took that multi-tenancy repo that was already out there, the one that's open source. And we just kind of made it work for our internal environment. And so enablement, no, not too hard buy in. I mean, you know. Does that answer your question? Yeah, it took a couple pitches to get it to take. But now we're happy to report that everyone loves it now. Even our operators, like, like, oh, we should GitOpsify this. And everyone's just excited, right? Takes a while, but once they're there it's... Did that answer your question? Yeah, we can talk later. Yeah, for sure. We do have another question over here. Hi, guys. So you represent a large commercial organization, bravely, you know, traveling down GitOps. I have a multi-part question. So since you operate in a large organization, do you guys have a semblance of a sort of unified, enterprise-wide GitOps strategy both on the methodology and tool chain and infrastructure side? That's sort of the first part of the question. And second, do you guys incorporate any runtime policy or, what other way, image security scanning? And do you have any policy around that? How do you incorporate it in the GitOps process? Thank you. I don't know if either of us caught the first part really well, but we can answer the second part and then... The first part is, do you have any enterprise-wide approach across the teams? How do you go about provisioning clusters in the right way from the configurations in GitOps? And how do you go about pointing your pipelines, et cetera, et cetera? Do you have a uniform approach across the organization? That's actually a really good question. So we don't have an enterprise-wide cluster approach. I think it's more of like... So there's these clusters that are maintained, that's like the operators that we mentioned earlier, but it's more that a team would request a namespace on our on-prem Kubernetes. And that process, they actually already had it. It's through a UI that they call the namespace portal. And we kind of just integrated with that. And that's where... Through that process, there's a pipeline that actually stands up the multi-tenancy namespace repo that's being monitored by the cluster-level flux. And so in that one, that's where that flux instance is stood up. And you can give it your config repo URL. And so it'll be listening for that. Does that answer the first part of your question? Okay. If I may add too. We're not to the state that we want to be yet. We realize we have Kubernetes customers, not just on-prem. Yes, it's a large multi-tenant cluster, but we have Chris here, also from State Farm, who has this little EKS cluster. So our desired state is to continue to use Terraform and even how we stand up and standardize and how we stand up these clusters. Okay. We'll get there. And then the second part of the question about scanning images, I think you can hit on that too. Oh, wow. I think you can. Yes. And we can chat some more, but before we were using... Now it's going to be SNE. Before we were using... I'm drawing a blank. Oh, yeah. Aqua and then we recently switched to SNE. So we're going to invite our next speaker to start setting up, but we do have time for some additional questions while that's happening. So the next one's from William. We'll move our stuff though. Okay. How do you see or consider disaster recovery in that setup? How do you see disaster recovery? In that setup, how do you see disaster recovery? Okay. So I think we got this question online too for May early. And so I have mixed feelings on this. I mean, I know that we obviously have created a heavy reliance on our GitLab instance, and we have seen it be down before. My opinion is that if GitLab is down, you can't run the tests, you can't run the pipelines, right? So your code really shouldn't be going to prod either way. There was something you had earlier too, I think, to answer that, besides that point that I was making. Reconcilers, the concept of reconciler, but that's really a bright glass scenario. May or may not cater into DR stuff, right? So it's all in Git. We can roll back a commit shot. I also understand that we're talking lots of components. So good luck rolling back in the right order. So that's where we're at. We're not to that state yet where we can accurately report, okay, it's this entire stack of components that make up a product. But I will counter that though with, like I said earlier, small miniature changes going straight to production, less context switching. So if what I change breaks, I know exactly how to fix it before I compound that with other problems. I also understand that maybe those problems don't surface right away. And that's the maturity point that we're getting to, not just performance metrics observability, but actual business metrics. And that's going to get into the progressive delivery topic that we really want to get going on. Awesome. Thank you so much. Give her a round of applause.