 Hi, everyone. Welcome to our session, GitHub's Everything. My name is Ayelet Delos. I'm a platform software engineer at AppsFlyer. Today, I'm going to talk about a solution of GitHub that was constructed in our special team in the platform group that was built by Eliran Bivas. If you went to a stock earlier, you may have seen some of the introduction I'm going to go through, but be patient, because I'm going to dive into the actual architecture and the solution itself. If you're not familiar with AppsFlyer, this is AppsFlyer with the market lead and mobile attribution. You can see the amazing numbers. We were founded only 10 years ago, where we really have more than 3,000 customers. We have 20 offices globally, and we're continuing to grow all the time. If we look into our AppsFlyer engineers, we have more than 400 engineers working in the squads. We handle, we have more than 8,500 microservices handling more than 2 and 1 half million events per second. And we handle and integrate with thousands of cloud resources and SAS integrations. We grew exponentially and an amazing growth, and we want to continue to grow. We want to improve ourselves and to know how to handle this amazing growth. So this is what we're going to talk about. We're going to talk about our developers' pain points that we found, and that in order to continue to grow, we need to solve their pain points. As you know, this is a GitHub conference. So why did we choose GitHub as our solution? How did we approach the solution? And what was our special AppsFlyer solution? So we started to look at our developer experience, day-to-day activity, to understand what are the major pain points in their developer's experience. We first started to look at the daily operations. We found that they are very, very fragmented, meaning, for example, a user, a developer, wants to commit a single commit to develop a single feature. He needs to go through the Git through a CI system, a test system. He needs to go through a different service for the deployment of the service and a different system for monitoring his service and production, very, very fragmented. We looked at the developer autonomy. As I said, our teams work in squads, and meaning, they should work in full full autonomy. And we saw it's not really the case. A lot of times, the developers need help and assistance from the platform team. They don't know how to build basic infrastructures. They're missing permissions. I can say as a platform engineer that we get multiple tickets from developers that they don't know how to do something or they're missing permissions, and then they need to wait for us to give them permission or to assist. We looked at the setups and apps where a setup can be bringing an RDS cluster up. We saw it's not very easy and not clear to understand what happened and how to bring it up. Lastly, we looked at that transparency of the developer's development process. We saw it's very not clear. It's not easy to know what happened in my service, who deleted my cluster, who committed this commit. Very hard to monitor and understand what happened. After we understand the pain points, we decided we must increase the ownership of the developers and bring that to productivity and to remove limitations, remove bottlenecks from the developer's experience and development. So, why did we choose GitOps as a solution to approach these pain points? Most of you view or probably after this day, all of you know that the four goals of the GitOps that should be declarative, versioned and immutable, pull it automatically and continuously reconciled. We should always get to the desired state the user wants for his service. We saw that the goals suit our needs and we decided to base our solution and the GitOps principles. We created our own solution principles. We wanted to be intuitive, very easy for the user to adopt and understand how to use. Everything should happen in a single source of truth in the same place, with not fragmented over the place. Self-serve, we want full autonomy for the user. He should know how to handle a service by himself. It should be declarative, out of the bow. It should be very clear what happened before and community-driven. We don't want to invent the wheel. We want to use best practices of the community and adopt them in our solution. So, how did we approach this solution? We looked at the developers day-to-day as we see it in AppSplire, developers flow in AppSplire. We have developer A and developer B. Developer A, he commits some code changes and declares some service deployment plan and commits it to Git. You see GitLab here because it's what we use in AppSplire and developer B, she commits some infrastructure requirements and some page-reviewed policies. They both commit and push to the same Git repository. Then, without getting still into details, the GitOps workflow comes into account, knows how to take all the changes that was just committed to the Git repositories and apply the changes and communicate it to all of our resources from Kubernetes deployments until SAS integrations. Everything, we want to cover everything. Everything should be automated by us with the GitOps workflow. So, Kubernetes deployments, cloud services and SAS integrations like data dog monitors. Just by that, without diving into the actual solution, we already have from Git an intuitive solution, a single source of truth and inaudible also coming for a Git solution. After we saw that, the simple flow, let's dive into our actual AppSplire solution. We started looking into our GitLab repositories. We decided to create a metadata folder. Maybe some of you are familiar with a metadata folder because the CNCF already started using it, but we created also our own unique one that will suit our developers and AppSplire's needs. As you know, developers don't like to change their repositories, to change their services, so we wanted to create something very clear, very easy to adapt. So, like you see here, we have a .af folder. Under it, we have three subfolders, actions, deployments and environments folders. In the action folder, we'll define in each subfolder a different action for the service. For example, a build folder that will define how to build the service. Then we have the deployments folder. Each service can have multiple deployments because the service can have SAS deployments like data-to-monitors or pager-duty policies. It can have also other services it needs for being deployed, so every deployment will have its own subfolder that will define how this service should be deployed. And lastly, we have the environments folder that will define general environments for the service, for example, and which region to deploy, which availability zone. Like you see, all the files are Terraform files, declarative language. If we look inside, one of them, if you look inside, you can see that with Terraform, we can also provide platform models we provide for the developers. We now, for bringing up a Kafka cluster, the user doesn't need to know anything. All he needs to say, I want a Kafka version with this version, and that's it, we do the rest for him. If developers are familiar, with Terraform, they can write their own resources and like an RDS cluster and also use Terraform-provided SAS integration like creating a new data-dog monitor. We gave great, great power for the user. He can now define anything he wants in his services, any deployment, and as you know, with great power comes great responsibility. A user now, we need to know what's permissions to give a service, who can get inside my service, what can go out, and by that, we came and found Open Policy Agent OPA. We chose that because it's a very, has a very, very big community, specifically in the CNCF. It's policy as code, very easy to test, and there's a lot of policies, best practices that we can take and leverage from the community and we can also contribute our own. We, for example, contributed a Terraform policy, how a deployment in Terraform should look like. And now our solution looks like this. In the Git repository, we have the .af metadata folder with the desired state, reading and telephone. Then the GitOps workflow comes into account, knows to take the desired state, all the changes and integrate it to all of our resources, as long as it's in the policy of the OPA. So, let's stop a minute, look at our toolbox. We have the Git, the Kubernetes for deployment, Terraform for the desired state, and OPA for the policy. And just by that, we covered all the principles. It's an intuitive solution, single source of truth, editable, coming from Git. Terraform is a very self-serve language. It's a declarative language and all of our toolbox is community-driven. We didn't invent any wheel. We're using existing tools from the CNCF. Okay. Let's go back a minute to the GitOps goals. It should be continuously reconciled. We should always get to the desired state that the user defined in his service. We investigated and saw that the GitOps continuously reconciled solution in the community is more of an application-based. And as I mentioned before, in our company, in Upspire, we have a huge, huge infrastructure. We have more than 8,500 microservices handling more than 2.5 million events per second. And we have thousands of resources we integrate with. We saw that the solution that the GitOps community have doesn't suit really our needs here because we want to also give the solution to the infrastructure, not only to the application, to the service of the user. You may say, we have a Terraform plan. We're using Terraform for the desired state and running Terraform plan can detect the drifts in our service, right? Because when running Terraform plan, we can see drifts in the service, what is missing, what was already deployed, what will be added, and so on. So why aren't we using that? As I already mentioned a few times, we have more than 85 microservices. We can't run every, because we want to detect drifts all the time. We can't run Terraform plan every few minutes and 8,500 microservices and 1000 of resources. It's very, very expensive, time-wise. And we're doing this solution to improve the developer's experience, one that I experienced to be fast and to be clear and this didn't suit our need. So we started to search for a good solution for our case, for the big infrastructure we have. After investigating and looking for new startups, maybe, and so on, we found Firefly, a new startup in Israel. That gives us the ability to get alerts from our code and also outside our code. Without us doing any action, we can get, we just get alerts from outside of our code. We don't need to run any command like Terraform plan. We can detect, with integration with Firefly, we can detect code changes inside our code, inside our desired state. We can get, detect drifts in our desired state. For example, if developers changed or deleted manually an EC2 instance from AWS will get an alert that this happened and manage resources, resources that exist. For example, an EC2 instance that we have in AWS but doesn't exist in our desired state and goes resources that we think we manage but don't actually exist. And now our solution looks like this. We have what we said before and we have detection alerts of drifts inside our code, inside our GitOps workflow and outside looking at our cloud resources. Okay, so we have this. We have detection from the code, outside of the code, but how do we manage our repositories? How do we know which repositories to manage? I'm sure most of you or all of you are familiar with Flux. Basically a set of Kubernetes controllers to keep our cluster in sync. We use Flux as infrastructure, not as a solution, we use it as infrastructure and we use some of the controllers in our solution. We use the customized controller and the source controller. The source controller will reconcile between our Git repositories. It will look at all of our services, all of our repositories in the Git lab and detect changes, if it detects changes it will clone the repositories, the artifact and save it in the Git repository CRD. And we have the customized controller that will look at our fleet repository. Our fleet repository is basically a set of YAML files, each YAML file will describe a different repository we need to handle and it will detect changes from that and will clone the change and then it will be synced with the source controller. So by that we have detection also and our services we need to manage in Git lab all of our repositories. Okay, but how does it all connect? How do we actually do the reconciliation? We have detections from the cloud inside our code, we know which repositories to manage, how do we reconcile all of our thousands of resources, hundreds of microservices. Sorry. So for that we created our own special solution called deployment watcher. In the deployment watcher we have two Kubernetes controllers, we have the Git repository watcher controller and the deployment unit controller. The Git repository watcher controller will reconcile between the Git repository CRD and the deployment unit CRD. The Git repository CRD as I mentioned before is getting updates from the flux controller about repositories that were changed or detected as new, okay? Once the Git repository watcher controller detects a change it takes all the deployments from the service. As I mentioned before for each service we can have multiple deployments. Each deployment will get a section in the deployment unit CRD. Then we have the deployment unit controller that will reconcile the deployment unit CRD, it will detect new deployments that need to get deployed. Either by alerts from the Firefly as I mentioned before or by an interval of 10 minutes it will go and look at the deployment unit CRD. Once it detects a deployment that needs to get deployed it will send it to the deploy tools. Why doesn't the deployment unit controller do the deployment by itself? We saw that infrastructure deployments for example bringing a Kafka cluster app can take a lot of time. It can take a few minutes, 10 minutes and the controller in Kubernetes needs to be fast to take milliseconds. And this is why each deployment will get its own Kubernetes job and the deploy tools that will know to run Terraform plan and Terraform apply. And then later on when the deployment is successful we'll get an alert that it was finished successfully. But the deploy tools will handle the deployments not the controller. This will only listen and reconcile on all deployments. This is our full reconciliation solution just for numbers. With this solution we managed to benchmark 1,000 deployments, reconciliation deployments under four minutes. But don't think everything went so smoothly. We have a big infrastructure and app supplier and we need to know how to handle this amazing number, amazing success. I can tell you for example in my team we're in charge of the GitLab and app supplier and a few weeks ago the whole GitLab crashed and we didn't understand why. We investigated that and then found that this amazing solution sent us to the GitLab 4,000 jobs at the same time. So we need to still work in practice to know how to handle this amazing numbers, amazing success. But we're getting there. And this is our solution. We have the terraform for desired state. We have detections inside the code and outside of the code from Firefly. We have the flux for listening to our repositories in Git. And we have apps that constructed everything with our special reconciliation solution. Okay. If you remember I spoke about the transparency that we want the developers to have and what happened in their code. And now with this amazing solution everything is happening in the GitOps workflow. The user, the developers don't know what happened with their services. Everything is happening there. They only describe and define their desired state. So how do they know what happened in their service? For that you can see here if you see up there it's a GitOps user in GitLab. We created a GitOps user that each time a deployment is happening or a change is happening from the GitOps workflow in our GitOps workflow it will send a commit to the GitLab repository. You can see your terraform plan, terraform preplan plan, apply and verify. Terraform verify you're not familiar with because it's not really a terraform command. We created it for this scenario. If we open, for example, the terraform plan you can see exactly what happened here, a terraform apply, sorry, exactly what happened here, what, how many resources were added, the change were destroyed. If we open the terraform verify that we created we can see exactly what happened in the terraform deployment in which pod it was deployed, what time, if it was successful or not. You can also see here a link to a data dog monitors and the data dog monitor we can see exactly what happened in the deployment. You can see that the last log is that the job was finished and the status was synced. And by that we solved all of our pain points. The operation, the daily operation that the developers aren't fragmented anymore. Everything is happening in the GitOps workflow. The developer autonomy is very high because they can define what's happening, what will happen in the service and the desired state which is very self-serve in the terraform. The setups are very easy to bring up because we do everything for them in the GitOps workflow. And the transparency is very high either by the editable that is coming from Git that can be very clear what happened in the service or by what I just show you now that we commit to the user exactly what happened in their service. So everything was solved. And that's it. Thank you all for joining my session and if you have any questions, I'm here. Do you want to take questions now? You ready? Anybody got questions? Hi. One question is have you considered using a crossplane if you are familiar with this technology? And second question is how do you manage workspaces with terraform? How we manage workspaces in terraform? Yeah. So regarding the first question that in the previous stack, Virgan Vivas, he mentioned that we went shopping in the CNCF community and we investigated and also what we spoke with a lot of vendors to see what will suit our needs the best. And this is how we came to this solution. We saw what will work best with our huge infrastructure will be the most automated and this is why we chose these solutions. Again, what was the second question? How do you manage workspaces or you use just the default workspace in terraform? Workspaces. We use just the defaults for now, but yeah. Okay, thank you. You're welcome. Any other questions? You said that you have a lot of resources to manage and then it was wondering if you also manage physical nodes so bare metal nodes or it's just everything in the cloud? Not everything is in the cloud. We also manage all kinds of SaaS integrations not only cloud assets. Like we manage a vote, we manage data dog monitors, we also have a lot of page duty policies. So the installation of the node it's managed in a GitOps way? Yeah, everything that telephone can, yeah, they can define in telephone, we can manage. Okay, thank you. Awesome. So I saw that you include the get keeper in your GitOps workflow and I was wondering where do you check if those policies are, I mean where do we enforce those policies? I mean, once the changes are committed in Git or while, for example, you're opening a pull request, actually which part of the workflow? Before we communicate, we deploy the changes, the open policy agent that knows how to work with Kubernetes clusters. We have the policies defined there and if the deployment doesn't suit the policy, it won't get deployed to the warning. Yeah, it's before the deployment if I understand correctly. What happens before merging the pull request on the master branch? Yeah, the GitOps workflow is before, yeah, the user, the developer defines the desired state but only in the deployment process, before it happens, the OPA will come into account and detect the policy if the deployments are in the policy. Okay, thank you. And another question actually, concerning the way you deploy resources, if I understand correctly, you have a wrapper around the configurations around whatever resource you wanna deploy that you deploy with Flux and then you have Terraform which does the actual deployment of Kubernetes resources. And you do this for making sure that you decouple the deployment time with Flux so that Flux will be happy and then you use Terraform to deploy the actual Kubernetes resources. Right, Flux, we only use to manage our repositories to detect drift there or changes into our repositories and which repository to manage but the actual deployment we do with our special solution that uses at the end the Terraform for deploying. And you do that for custom resources as well? Yeah, for everything, we use it, yeah. Thank you. Sounds like a really robust solution. We have a huge infrastructure. Any other questions? All right, oh, one more, last one, everybody. How do you handle the Terraform state file for every Terraform service? How do we have that Terraform state file? Yeah, whenever you run Terraform, you know, you need to know the state from the previous run, right? Right, so because I mentioned that the Terraform plan command, we don't use for detection all the time, right? But the Firefly that we integrate with is can detect also changes there inside the code and it will monitor it and send us alerts if there's changes or a drift there. Cool, all right, thank you very much. Thank you. Thank you.