 Hello. Welcome to our talk. We're representing teachers, paid teachers. We're going to talk about how we implemented GitOps and, you know, what we learned along the way. So what we're going to do is we're going to start with some introductions and some context, and then we're just going to go straight through the decisions that we had to make, kind of learning GitOps as we went. It'll take about 15 minutes. Then we're going to wrap it up, and finally we'll get some Q&A time. Introductions. So, hey, my name is Raptor, really Raptor. Been doing DevOps-y sort of stuff for like six years, mostly with education tech things. And I'm just super excited to be here. I don't know. Do you want to introduce yourself, Alicia? Sure. Hey, I'm Alicia. I'm really excited to be here, and I'm also very happy that KubeCon is here in Detroit because I'm from Southfield, Michigan, so I didn't have to travel, which was pretty awesome. Also, doing DevTools, kind of stuff for about six years. And, yeah, excited to be here. Some stuff about us. So TPT, teachers pay teachers. That's the acronym. We've got about five million resources available on just about every aspect of pre-K through 12 Ed. A lot of resources. Teachers just generally know us, and they usually have feelings about it. Generally positive, but if you have any teacher friends, definitely ask them. We've got about 250 employees, and we're mostly remote. And now, like, we're going to start getting better and better stuff. So, engineering context. So, we're about 100 Devs deep, and we've got 101 services. We got really lucky when we started TPT, both of us, like, about a year ago, in that almost all of our services have a lot in common using a similar Node.js template. We pretty quickly ruled out tackling the monolith. And we also have some front-end services that we were able to work on, too. Our infrastructure stack is in Kubernetes. We have three cube clusters on UKS. And we also have a bunch of serverless apps in there. We didn't really think that we'd be able to do the same GitOps stuff with them, but it all kind of worked out well. And our team topology is we're on DevTools, but a lot of our, like, actual, like, deeper cloud knowledge is on a different team. So, we worked closely with them on all of this. So, what led us to GitOps? So, you're here. So, you probably know a thing or two about GitOps. So, what brought us there is we wanted to make software delivery improvements. So, starting a new company, you're not sure how you can help, what you can do. And we noticed, like, three main things come up. The first one being reliability. So, there were a couple of rollback bug issues on our legacy system, which we called Houston. So, this legacy system would call into Jenkins. And Jenkins would do the actual work. And it just didn't always work. And developers were not happy about it. This was for rollbacks and deployments, generally. Another thing they really cared about is visibility. So, because it was two systems, it was hard to know where the problem was if you were a developer day-to-day. And you didn't get a lot of user feedback from the first system that said whether things were working or whether they weren't working. And then the third thing that we really wanted to, and that has come up a couple of times today already, is, like, different deployment styles or canary deployments. So, we really latched on to a couple of different services for doing that. So, again, how can we improve software delivery? It took us, like, a little while. We had, like, some complicated product, like, charts. But we decided on using Argo CD. We had some experience with Flux One and Flux Two. And it was just a little rocky. It does a lot of cool stuff, and it taught us a lot of lessons and things we were doing. And we wanted to try something new, particularly with the upgrading from Flux One to Flux Two. It wasn't as smooth for us. We liked the Argo CD UI. It was just kind of bundled into it, basically. Like, that was nice. And we weren't experts on Kubernetes. I still don't feel like an expert on Kubernetes. And so just having that UI is helpful. And we just liked GitOps. And we liked it enough that now we're here, I guess. Cool. You want to talk about some decisions, Alicia? Thanks, Raptor. So, after we landed on using Argo CD, we had a bunch of decisions to make. First of all, where do we make those GitOps commits to? I think it was Jim from Harness. I thought I saw him there, yeah. Who was talking about that before. And then what are in those GitOps commits? So, first, to talk about, like, where we make those GitOps commits. So, like Raptor talked about, we have about 100 services or so, each one in their own application repo. And in those application repos, they have, we're using Helm, so they have a Helm chart plus the values files for both staging and production in the application repos themselves. Once we spun up Argo CD, we created a separate application manifest repo to store the Argo CD application manifest in. And we were kind of thinking, hey, it would be a good time, even before adopting Argo CD, we were thinking maybe we should get those Helm charts and values files out of the application repos and centralize them, because it's a pain to manage all the individual Helm charts and values files in each of the application repos. And we were kind of like, how should we do this? But we also, we just wanted to adopt Argo CD at this point, and we're like, let's make it small, incremental, let's keep the existing structure that we have, so that's where we landed on keeping it in the app and service code repo. Once we decided to put it there, we had to figure out what goes in those GitOps commits. So for deployments, it's basically just bumping a Docker image from most of our deployments, and we figured, okay, that should be simple enough, like our GitOps commits could just do that. They'll update the Docker image to the latest tag. And then, like I mentioned, we have staging, prod. The way pre-Argo CD before the way our deployments worked is developers would manually or sometimes, in some cases, it was automated deployment to staging, and then either an automated test or a manual test, check it out, then promote it to production. So we wanted to, again, continue the same flow, like make as little changes as possible. And to that end, we decided we'll have two GitOps commits. We decided we'll go to automated staging deployments always, so no more manual staging. So as soon as you merge your feature branch to the main branch, we'll do a GitOps commit, bump staging, and then have a pause for approval, and then promote it to production. After we implemented that way, we realized we actually had a couple more things that we needed to put into those GitOps commits. So we have these internal tools that make sure that any time you make any HelmChart changes or values files changes, you bump the HelmChart version, too. Again, we could have just changed those internal tools to avoid it, but we're like, it's not such a bad idea to bump the HelmChart version every time we do this anyway. So we added that to the GitOps commit. It's not a big deal. We're not making those commits. The CI system is making those commits. And like you might have seen, we're using CodeFresh. Also, in addition to that, our CI system, like I mentioned, we do staging, pause, and then we do prod. So since we're making the GitOps commits to our main branch, as soon as we did that staging commit, if we wanted to do a Slack notification with the author of the feature, like the commit author, we'd have to look back at commit, and we're like, okay, we could definitely do that. But instead, why don't we just add the commit author to the GitOps commit? So it was interesting. That's what we did. And here you can kind of see where we got in our V2. And V2 was great. Things were working fine. We had the good Slack notifications. Things were working. But developers were mostly happy. But we started getting some feedback that actually deployments were getting delayed. And it's an interesting reason for that. So prior to adopting Argo CD, like I mentioned, somebody would merge their feature branch to main branch. They would do a staging deployment, and then pause, take a look at prod. As soon as that fellow would merge their feature to main branch, if somebody else wanted to do a deployment afterwards, they could already rebase their feature branch with the latest code from the main branch. This is prior to Argo CD. Now, once we adopted Argo CD and we're making those commits directly to the main branch, in order for the developers to get their feature branches ready and to rebase, they actually had to wait until all the GitOps commits were done, then they could rebase. And it wasn't trivial because, unfortunately, we're hoping to improve this. Our CI times are a bit slow. Like, probably a lot of people have this. But, therefore, it was slowing down deployments. So we realized that, here's the Azure script again, what we realized we wanted to split out those GitOps commits to another repo. And like I mentioned before, we already had an Argo CD application manifest repo. So we were thinking, again, let's maybe reconsider, we'll move our values, files, our charts out of the application repos and move them into this Argo CD application manifest repo. But, besides for not wanting to make too many changes and boil the ocean, we also were relying on the presence of the charts and values files in the application repos for another internal tool, which spins up in a femoral environment every commit. And we're like, this is going to be a lot of work if we want to do this. So we're like, what else can we do? Is there any other way to do this? And we stumbled across, in your application manifest, your Argo CD application manifest, you could actually specify Helm parameter overrides. And we're like, okay, this is interesting. We can now make a GitOps commit, and I'll show you what it looks like, to our application manifest. And it's kind of an interesting thing. And, in fact, the documentation, Argo CD docs called out and say, like, don't use this in prod. And we were like a little bit hesitant because this is kind of a hacky way. So if you could follow along here, we have our application repo with the values files and the charts in one repo, your application specs, your Argo CD application specs in another repo. And in that other repo, we specify the overrides of the Docker image to use. So the Docker tag is being set in that Argo CD's application spec. We reached out, like on Slack, I guess, to the CNCF, or no, I don't know, the Argo Slack thing, and asked, and somebody's like, yeah, it seems okay. And it's been working fine. And it's great. Application owners are very happy. Things are working fine. And so far, we're liking. Raptor, nope, stole me. Sorry. One more thing. So now that we have these changes to the Argo CD applications, manifests, how do you apply those? So one of the things that people may be familiar with is the idea of an app of apps. So you could have an Argo CD application that manages other Argo CD applications. So even without all the other stuff that I discussed, we were kind of thinking, hey, maybe we should do this app of app thing, but we're like, I don't know. Yes, no. But here was a good use case for it because we wanted to sync the application, Argo CD applications themselves, with the Helm chart parameter overrides. Those need to be updated before you do a deployment. Because if you don't update those before you do a deployment, you don't have that override in. And therefore, you're getting the old code in. So you've got to make sure that those are off the date. So we're like, once we make that GitOps commit to change the Argo CD application parameter overrides, we need to make sure those are synced. So we were thinking, this is a great use case for app of apps. Let's try it out. But we quickly realized that it's actually a single point of failure. Because if anything happens with any of those apps, the parameter overrides, which we actually experienced, we had a Git commit that happened to be only numbers. And Argo CD complained that it was an integer when it should have been a string for the Docker image tag. I might be butchering this. But basically, because of that, the sync failed. And once the sync failed, none of other applications were getting updated. And if they don't get updated, then all the deployments are using the old tag. So we decided, actually, we'll just do a cube apply, and things are going well. And now, now do you, Refter? That was good. Yes. So I get to talk about syncing. I really like syncing. I hope you'll join me in thinking about syncing. So this is the Argo CD page. When you click the sync button, there's like a pop-up that's like, okay, here's what you're going to do. And I am not always super great at reading everything. And so I saw this page, and I was like, I don't know what to do with this. But I quickly realized that I needed to know what these did. In particular, I'm going to call out AutoCreate namespace, which is one of my favorite flags, and would totally encourage everyone to use. And then the bigger, scarier one is Perune here. So let's talk about Perune, maybe. So initially, we decided to trigger Argo CD syncs manually. And our reasoning for that was to get more feedback to users and more control, especially as we were gaining more comfort with the system. We had talked about doing Argo CD notifications instead to handle that. But we wanted to keep it consistent. It was less work to keep consistent with how we wanted CI and stuff to move over to Argo products in the future, too. So we decided to keep Perune turned off for the first couple services that we did, mostly out of fear. But eventually, we dove into the Perunes. So you can see this delicious bowl of Perunes. And once you try a Perunes, you can't stop eating them if you're me. So we turned it on. It was something like Cron Jobs, which we needed to delete every time we deployed one of our apps for. And these Cron Jobs had similar names that would conflict. And so when we had Perune, it worked. And we didn't have Perune. It didn't work. And we now have Perune on all the time. And it's just been fine. So yeah, go for Perune. We did decide to continue doing the Argo CD syncs manually. We were just pretty happy with the ability to get feedback and interpret feedback for running Argo CD commands. I feel like it also helped us learn Argo CD better, is like specifying the different Perune flags every time and knowing what you want, the different sync flags. So how do we migrate? That's one of the fun things that we as an end user company can talk about. So we started to do one at a time pairing. We got lucky. Our cloud team was like, here's a service. It's live in production. If it breaks it, it won't be that big a deal. And we were pretty happy with one at a time pairing for a while. In part because we were starting to learn GitOps, but our company didn't really understand GitOps. And this was one way to really build that education. And I think one at a time people just respond more and we learn more and everyone's happier. After we got better at that, we formalized a process for this, which turned into the self-service pipeline to add to pipelines, which is kind of just a fancy way of saying like there was one process that teams could click a button on and migrate their services over to using GitOps for. We kind of saw how that goes for a while and if they could actually self-service it. Some of the reasoning on these decisions is that when we migrated our first couple services over, we saw a couple bugs that we hadn't picked up on or guessed would exist, including like the malformed strings one that Alicia mentioned. We found some staging database connections that didn't work, which is like too bad. And we're glad we found those. There were a couple Kubernetes settings that were just out in the wild that weren't actually doing anything. And then with like with ARGA, we figured out they weren't doing anything. And we're just kind of nonsense settings for them. So that was nice. But yeah, just being involved in figuring out all those different use cases. At about 30, we started thinking how to automate through the rest of them because we remember we had like 100 services that we were trying to tackle. And we haven't gotten them all the way through, but it's going pretty well. The automation is fine. Did people like it? Yeah, the early adopters super loved it. Again, our legacy system was OK. We needed to improve the things they saw we improve the things they were they were happy. There's a couple services that are less excited about moving over changes hard, changing how you deliver software. As always, things go wrong sometimes, right? So it's understandable. We had it down to like less than 30 minutes for a service to deploy to run CI staging and production. We didn't get any major issues rolling out new services related to GitOps, which is great. We kind of thought there would be more, but it was fine. One thing that we even expected to have issues on was how ARGA works is it does a Helm template to make files and like templateize files, and then it does a cube apply on them. And before we were doing Helm upgrade to manage our versions. So switching from Helm upgrade to the Helm template and apply, that was fine. So if you are using Helm upgrade right now and are worried about that, it'll be OK. And there was the roadblock as mentioned before for GitOps commits in the same repos, plus long CI times, plus rebasing is frustrating. But yeah, we solved that in the way that Alicia brought up. So it was kind of a lot, maybe. I felt like it was a lot. Does anyone have any questions for us right now? Otherwise, we can talk about other stuff. I see one over here. Do you want to start? Hello. OK. It's really interesting because we went through this exact thing a year ago. And if you wouldn't mind, I would love to talk to you afterward, because I think there is a next step that you can take that would be really it's it's quite impressive. It's exactly the same path with the same repo first and then a couple of repos. But the way forward, I would say with application sets and how you can divide different things and the values. So yeah, we'd love to chat afterward. That's it. I will not bother everyone with it. Thank you. Hello. My question is, can you talk a little bit more about where the bottlenecks were? If there were any regarding the scalability you mentioned, it took 30 minutes granted there were multiple environments. And yeah, like what did you notice? Where were the bottlenecks in order to deploy a lot more services? Let me try to repeat the question. So you're asking when we were migrating a bunch of the services? Where were the bottlenecks? And did you say something specific specifically to more more on the deployments, not not migrations, just deploying your code to an environment? Like, where did you notice any bottlenecks? Or did you even notice them? Do you mean bottlenecks in terms of time? Right? Yeah. So like how many deployments could you deploy to production? You mentioned 30 minutes, right? So if you had like 30 services, so what happens if you do 100 services or 500? Sure. And where would you see the bottlenecks? Got it. Yeah. Thanks for the question. So just to try and repeat it is we have like 100 services. That's a lot of services. Where do you see the bottlenecks? It takes 30 minutes to deploy, etc. So to try and address that to be fair, we did not do 100 services at once. So we never really experienced like a huge rollout. What we did is, like Raptor talked about at first, we did like one at a time. And even now, once we're starting to automate this, we're going to just put in a space in between them. So we're never going to hopefully, I mean if we start a new cluster or something like that, we should run to it. But just to avoid the whole issue, it's like we don't need to do the migration all at once. So we just did it iteratively and just broke the deployment, the migrations apart. But one thing is to address maybe something else you're asking about is we haven't had any issues with scale yet with deployments currently. So there are obviously application developers make changes at any time during the day and maybe five applications will be deployed at the same time. But we, I don't think Raptor, you could correct me if I'm wrong, we haven't seen any scaling issues yet. So that's encouraging, I guess. I'm trying to understand if the developers are your users basically, they're end to end workflow. So they make a PR with their application code, right? And they commit that to the main branch or whatever. And then you have this Argo CD repo. I'm going to stop you right there for one second. Yeah. They make a PR and then they merge that, the feature branch to the main branch, okay, same thing. And then they, then they do a new PR on the Argo CD repo, right? No. No. Yeah. So, sorry. And let me repeat. Yeah. I guess you're using the mic. I want to go through a lot of good slides. I was just trying to imagine, like, where, where are their human gates at is what I was trying to basically get to where the, where are the human gates for them? Yeah, sure. Should I take this route or do you want to take it? Oh yeah, go for it. Okay. So they obviously have to get PR approval. And they have to pass CI. So GitHub enforces both of those things. And then once they merge to main branch, they next, they themselves could approve promotion to production. So everything's automated. They just, actually through Slack, we just have an approve button and that sends an approval, that, that kicks off the next GitOps commit to change the prod tag and trigger a sync. One thing we didn't really, Raptor touched on this, but we're not auto syncing right now. So we're right. So who's doing the auto sync or who's doing the manual sync? The CI system. It's triggering. Okay, so it's not you. For some reason thought it was you and I was like, how does that happen in 30 minutes? Do you have someone sitting there all day long? No, so we're not the CI we use code fresh and code fresh is triggering in our missing. Thank you. Sure. Yeah, I guess we called it manual, but we're not doing it. Yeah, thanks for the clarification. Good question. How are we doing on time? So you mentioned you found earlier adopters, let's say the first guinea pigs for the exercise. Was there a process? Are you already new? Like, oh, we can always go to those five app teams because they're going to try things before others try things. Please go for it. Even killing it. All right, I'll keep rolling. So the question was, how do we identify early adopters basically like who volunteered? So like Raptor mentioned, we were closely with cloud ops. We're on the DevTools team, but cloud ops volunteer to service. One of the fun things that I like that we did is we, like again what Raptor mentioned, we have a legacy custom deployment tool, like he said it's called Houston. We made that thing on Argo CD. So deploying the deployer, the old deployer on Argo CD. So we started off with like our own apps basically and then we're like, hey, call to action. Anybody want to be an early adopter? And generally people were feeling the pain of deployments and like Raptor called out some of the, like there was low visibility, rollbacks were painful. So we're like, we're trying to solve this. Like we think this will help. Who wants to volunteer to take this journey with us? And luckily there are a few volunteers. So we started with those. And it was it was definitely a manual pairing process with those early adopters. But then people, you know, word of mouth, hey, this is actually cool. Like, so that's how it worked. Yeah, I think we definitely had to beg a little bit. Just we have like a TL, like a team lead concept. And like some Tls were like, yeah, let's do this. Let's adopt. And and some were like, maybe later, let's see what other Tls do. So just like slacking people from a list of a pod chart, like who owns services. Will you let us please use your service and make things better? If there's no more questions, we have other slides we can talk about. How we doing time wise? She was just with. All right. Cool stuff. If we have time. Cool. I think there was a lightning talk that I missed about this and I'm kind of bummed about it because one of the things that held us up when we like first started is like, what does the Argo CD app repo structure look like? And we found like a couple of blog posts that had like ideas. And we weren't really committed to following any specific one of them. So how we ended up doing is so we had three clusters, three EKS clusters. And we just made a folder called apps and a folder called RBAC for each of those clusters. I don't know if I worded that well, but perhaps you can see in this image there's EKS CI and then the apps and then these are some of our apps in there. The RBAC stuff didn't really fit in into the Argo CD structure that we like saw initially and so we just kind of added that later. But yeah, not worth getting held up on what the right repo structure should be. I think like ideally we had used helm files a little bit in the past and we kind of like that set up too. But with just three clusters it seemed like let's just keep moving. But we'd love to talk more about this if other people have better structures on that. Another thing that kind of held us up initially was installing Argo CD and Argo rollouts across clusters. We had seen autopilot, Argo autopilot, which I think was not yet fully released when we were installing things but we really like it now and I think in hindsight we would have used something like that. What we did use was the Argo CD home chart and the Argo CD apps, home charts to manage additional apps. And we do manage Argo CD through Argo CD so that's been cool. Chat ops stuff. I think you've kind of been owning the chat ops things. Sure. Yeah, so we've been going closer and closer to being able to just use Slack and just like we talked about the approvals through Slack and putting in tons of metadata into there so it's been pretty sweet. Yeah, so let's wrap it up. So where we're at now is a lot of our delivery stuff is working really nicely with Argo CD and with GitOps. We'd like to start incorporating Argo workflows. CI was the bigger beast for us to tackle so we got started with CD and delivery, improving that first. Things are pretty smooth. We didn't mention it, but the diff checking on moving over to Argo CD is really nice and we're just happy using Argo CD. We still have lots to iterate on and learn, obviously, as I think you've seen from some of the slides of just like our confusion. But like overall we feel good. We love GitOps. The company has been supportive and I think people have appreciated it and seen value in it. We're excited for the future. Thanks for answering. I just wanted to call out Christian over here. Thanks for this is our first time talking. So thanks for the call out to try and talk. And yeah, like you mentioned, we'd really like to continue the discussion if everybody has ideas. We know we don't know a lot of things. So it's great, great to be here. Thank you very much.