 And thank you all for coming. I'd like to welcome you to this session. I'm going to talk to you about today, about the changes and adaptation we had to take in this organization to adopt GitOps across the entire engineering organization. A little bit about myself. My name is introduced. I'm Eliran Bivas. I'm a self-proclaimed tech junkie. I enjoy experimenting with any technology out there and trying out new features and frameworks. I work for Upslyer for the past 2 and 1 half years now. I'm a cloud native horizontal leader. I work in for the platform group. A little bit about Upslyer. I'll let the number just sink in for a minute, because as you can see, there are a lot of. Upslyer is the market leader for mobile attribution. We operate globally across 20 offices and employ more than 1,300 employees. And as you can see from the numbers, the market is quite large and highly rewarding. A little bit about Upslyer engineering is an organization. The organization itself grew exponentially and now has more than 400 engineers. We work in a squad-like fashion and operate more than 850 microservices that handle 2.5 million event per second. We also operate more than 250,000 cloud resources that can vary, of course, from EC2 instances up to EKS clusters. We also have dozens of SaaS integrations like Datadog and more. This is Reshef Mann. Reshef is a Upslyer CTO. And about a year ago, he stood in front of the entire engineer organization and told us, look, you guys, you're rockstar. You're amazing. But something isn't working as expected. It's become very, very difficult to push a new feature into the market. And we must change our method of operation. And that new method will be GitOps. So what happened a few months earlier to that declaration by Reshef? Well, we sat down as a platform and tried to figure out what are the steps we need to take to actually take our infrastructure to its next level. But where do we actually begin? Do we even have issues in our platform or in our infrastructure? Are we helping our developers to achieve enough productivity in their daily activities? Simply put, we had a lot of unknowns to begin with. We didn't really know the exact scale of our architecture. Now that I got you the number of 250,000, we really didn't know that back then. We didn't even know what our developers are actually doing, what are they trying to achieve. It's not like we wish to provide them how to work faster, what prevent them from making it slower. But how are you actually going to start to explore such a huge infrastructure, such as ArtFlyer? As an organization, as an engineer organization, our mission goal is to support the organization towards its company's success. As platform, our customers are the developers. But we need to understand what are their daily needs so we can actually provide them with value that they can benefit their business. This is the end goal of every engineering organization. But how can we effectively collect feedback from all of this engineering organization? You can't simply go and ask each and everyone, what are you doing today? Am I doing right? Am I doing wrong? This is the scale to compare the platform compared to the entire engineering. So we can't really go and ask everyone. It wouldn't scale. And we only knew when there were problems, so they will reach out to us, look, I got a broken cluster or something like that. We looked at ourselves as a platform. We had issues as well. When we started the process, we were split into seven vertical teams that were handling different domains in our platform. We were working mostly in a survival mode. We were trying to put our head above the water, trying to put one fire after the other. Maybe a Kafka cluster would break down, a Spark job didn't complete it, or maybe a deployment didn't reach a certain state. Because of that, it was very hard for us to actually evolve our system. We couldn't really go to the next Spark version or maybe update our Kafka cluster. We were trying to just survive our infrastructure. And it was very hard to align. We were operating in squads, but all of them were verticals and didn't really had a unified experience for our user. Each and everyone had a different mechanism for operating. So it was very difficult to actually collect feedback because a certain feedback would fit a certain team. And it wasn't really an easy task to collect that feedback. If you look at the diagram by Hendrik that described the mode of operations, we were working in a high autonomy but extremely low alignment. Our entire culture was chaotic. It was very hard for us to align. But we were striving to work in high autonomy, high alignment operation, which allows us to be a more collaborative culture. And we were striving to work together towards the common goal. Our road ahead was unknown as I mentioned earlier, but we had to start collecting the information because we had to understand where we had it. We first wanted to focus on the developer experience and the developer experience that we needed to learn is what are the daily activities that we're trying to actually boost? What are the developers are actually doing and how we can assist them and what prevents them, as I mentioned, from moving faster? We introduced a new model that we called the Champions. The Champions were essentially software developers that selected from the teams themselves. And in bi-weekly meetings with the platform, we were able to get what are the requirements and their needs when operating with the platform. It took us a few months to kick off this entire process, but we learned a lot from it. We went from a model that usually we got information just when something wrong happened and then you can really know what they're trying to achieve. You only knew that you're trying to work with them to solve their problems. It wasn't really an effective way to collect feedback. We went to a model that actually got the Champions as a representative for the team and by that, he could actually explain to us, look, this is what we're trying to achieve. This is the goals that we're trying to have and these are the paint points that I have. And we used the model back to actually test out and validate some of our estimations like, will this tool be effective to you? Will this change work for you? And by that, we learned that the entire daily operation of our users were highly fragmented. They had too many systems to operate. If, for example, to push a single commit, it took you like seven different systems just to push it into production and it's very, very tedious work. You have to push a commit to Git, try built it maybe in Jenkins, try and test it in a different system and deploy it in a different one and the operation goes to a different system. The entire experience was ununified. Everything was fragmented. You had to learn how to operate each and every system. And we looked at the autonomy of our developer and we found out even though that we said that we were operating in squads, we didn't really provide them with any capability to actually succeed. If they had any infrastructure requirement, they had to come to platform. If the infrastructure was broken, they had to come to platform. Anything that they needed infrastructure-wise, we were their bottlenecks. We didn't really help them that much. And when you are giving an estimation for a project, it can't really give valid estimation if you're actually relying on someone else, platform. We looked at the setups and found out that they are mostly fragile. If something bad did happen, we didn't know how to bring the system back up efficiently. It was very, very hard to restore any failed system. If, for example, an RDS cluster was deleted, we didn't know how to bring it up. So we were working on a lot of magic there. And last but not least, we had a lot of issues on transparency and the complete auditability of our system. If something wrong did happen, we didn't know why. And it's very, very, it became a very, very tough job to actually learn what to do. And for the example, the RDS cluster, if it was deleted, who deleted it, when, why? We looked at ourself as platform. And we looked at the next evaluation for ourselves. First one we initiated, we called ourself platform, platform 1.0. And then we were split. Some might say too early to seven vertical teams. This is the platform 2.0. Now here comes our next evolution, the 3.0. So we selected two developers from each of the squads inside the platform and formed what we called a cloud-native topic. And a topic is a mixture between a chapter and a guild in a Spotify model. But you can simply adopt a model and say, look, I'm doing something. We did something different because of our understanding of our requirements. We were striving to push for technical excellency for across the platform. And by providing strong alignment for the entire squads in the platform, we were able to give them each and everyone an increased autonomy to work better, to increase their overall impact, increase their activity as a squad and eventually as platform and boost our growth both in infrastructure and human resource alike. We started by reading this book. We want to understand what is the community definition of a cloud-native infrastructure? What does it take for your scalable infrastructure to be titled cloud-native? And by the time we ended reading this book, we kind of looked like this. We were way off. But way off. Our infrastructure was hardly managed. We didn't employ any infrastructure as code policies. Pipelines, sorry. We didn't employ CI-CD pipelines correctly even. We had too many internal tooling that didn't up to the task. And of course, they didn't complain to any community standard. Shocked, but we still had to shop for a solution. So this is exactly what we did. We went around into the community and shopped for a solution. And this was our supermarket. So the CNCF landscape, as you all know, composes of thousands of technologies and we started shopping around for a solution. And testing out thousands of them takes time, but we did exactly this. We invested a lot of time about four months process and we created proof of concept of many, many technologies that we assumed that can help us. We needed to learn what are the benefits and of course pitfalls of each of these technologies. We created POCs to compare Kubernetes and Nomad. We created POCs to compare Argo and Flux. What is different with Terraform, Pulumi and Crossplane? And for that, it takes a lot of time but you learn so much. We met with hundreds of vendors. If I can actually picture it, this will be the smallest part. We met with thousands, hundreds of them. And my daily calendar looked like five meetings a day with different startups and was amazing because we learned a lot how they can actually boost our solution into its actual cloud native infrastructure. How we can get cloud native solution much better? We actually partnered with several startups to actually help us uncover more of the unknowns that we still didn't know about architecture. Once we completed this entire evaluation process, we set out to have this solution principle for our self as upslire engineering. We said that the solution must be intuitive. It must be familiar to everyone, easy to use. It must have a single sort of truth. We don't want the system to be fragmented anymore. It must be our one-stop shop for all the operational information that can be close to your code. We want to have a self serve by boosting the ownership and autonomy of our developers. We want the system to be declarative, no more new tool to learn. Simply tell us what you need. And of course, we wanted to boost the transparency of our system so auditable so it could be easier for us and for their users to understand what happened. And last but not least, one of the solution to be community-driven, no more inventing the will. We wanted to leverage the existing tooling and frameworks that the community already battle tested. If some of it, of all of it, some familiar, you probably heard it. Because the desired set of a GitOps management system is of course to be declarative, version, pull automatically and continuously reconcile. So we figured that if we'll simply say we're gonna adopt GitOps, we're gonna checkmark three of our items and we're good to go. We're only left to do the rest, that's all. But are you can actually adopt GitOps? How you went from the theory of it into real practice? Are you actually create a system that went from the hello world of it to the large-scale system of upslire? We're trying to work with each and every technology that we selected. We were dogfruiting for almost six months just to understand how to operate each and every technology and framework. We tested out several frameworks, tested several tooling, and of course several flows just to understand what is the pain points when using GitOps. And we landed on this intuitive flow, what we believe will be an intuitive flow. We took to, let's take for example, two developers in upslire. The first developer is doing the code commits for the service and also declaring the service deployment plan. The second developer, she's declaring the infrastructure requirement for that service and of course also declaring the pager duty policies to monitor that service. Both of them are pushing to the same Git repository and the declarative format that we selected was Terraform. Once everything is merged, a GitOps workflow kicks in and with a mix of Kube controllers operated by Flux, we can actually apply those changes and those changes could be either to our Kubernetes cluster, to our cloud resources, and of course our SaaS integration. We've given our user great power now and with that come great responsibility. We didn't want them to lose their autonomy that we just gave them, but we didn't want them to actually hurt themselves or others in the process. So we added an open policy agent to guard our flows and we can employ now policy as code to actually provide upslire policy into the process and of course leverage the community policies to guard our GitOps workflow and make it a lot safer. And if you're looking to get more information about this solution, be sure to check out Ayel-Ederot's presentation later on today at 3.30 p.m. GitOps, everything we sure can. Well, she described and go into a deep dive into the technology that the solution is actually doing. So we now figure that our solution is valid. It's intuitive, it has a single source of truth using Git in this case, self-serve and declarative with Terraform. It is auditable, again, using Git and everything now is not homegrown, but uses community tools. We started to give a lot of lectures internally inside the organization. We tried to educate the developers both in the platform and in the R&D, what is the different and how to use Terraform, for example, how to use Kubernetes, how to read the outputs of policy failure. What is the definition that we as Upslier want for GitOps, not the community definition? We tried a different approach here of actually saying that GitOps, for us, is managing workflow for the application infrastructure alike sitting in Git. And we focus on the developer experience using the tools that you're already familiar with, no new tools are involved in here. We explain them how this solution and how this workflow actually solved the pain points that they described earlier to us. The system is no longer fragmented, everything is in Git. You have increased autonomy because you now can employ several self-serve flows. And the system is much less fragile. We can reconcile much easier now and recreate what failed. And above all, the transparency of our system has improved tremendously. It's now very easy for our user to understand who made a change and what went wrong. But of course, we got a very cold response. And to be honest, it's totally understandable. Our engineer teams and other things on their mind, they were pushing for the success of our business. They preferred the known broken flow instead of working with a new integration now. But we didn't really back down on it. We used the construct that I explained earlier of the champions, for example, that we use the champions to better understand what are the fears and concerns when integrating with GitOps and how we can actually provide them with better confidence to work with GitOps. We took all the information that we got from the champions back to the topic and using that feedback to create a unified messaging and communication back to our developer teams. And we kept on dog-fooding the solution. We continuously improved the integration and flow to better assist and deal with the concern our user provided us. So where are we today? In terms of managed resources in our cloud accounts, we went up from 3% to 25% coverage in infrastructure as code and keeping in mind we are managing 250,000 cloud resources. In our new infrastructure, we are covering 97% now, which is a huge improvement for us. We employ several self-serve flows, either for Kafka, Vault, Druid, Console, Kubernetes, and many more. And this was the part that our user adored because they no longer needed to talk with us. They had to do it on their own. They loved it. Our entire engineering teams, each time we provide enable new self-serve flow, they adopt it immediately. Our overall stability of our system has improved tremendously. We still have a lot to improve, but we're getting there. Our provisioning time has improved as well. Some of the system took like a few days or maybe a week to provision. Now it takes a few minutes, so we are getting there. Our overall contribution as an organization team, as a platform team has improved tremendously to open source because we took the knowledge that we learned from using those tools on a large-scale system such as Upslyer and put it with it back to open source. We didn't keep it inside. We wanted to give back to the community all the information that we learned and we're still doing it. We're still in a continuous effort to support our own Upslyer scale in our Terraform controller implementation that already we contributed to flux. We have an improved ability effort to for our drift detection solution. Keep in mind that in a large scale like this, most of the drift detection solutions out there will simply won't work. So we partner with a startup that actually solved it for us. We're still evaluating several solution to continuously complete our GitOps flow. Just a small recap on what we've been speaking of for the past 20 minutes or so. We're hoping in this mode in about 18 months now, we started with a lot of unknowns. We didn't know what our users are doing and we didn't know what infrastructure is actually. We started by collecting the daily activities and understanding using the champions model to get their feedback. We used the topic to create a unified experience of as a platform. We invested a lot in evaluating and piercing several technologies and we kept on evaluating through ourself with dog fooding, the entire solution. We still are dog fooding. This is one of the I think the key item for us as a platform. Above all, we invested a lot in education and open communication of any change that we as a platform took to boost the operability of our systems. And of course, learn back what we can have, what we can modify. And as a platform, we kept going. We wanted to help our developers and promote the growth across the industry. Thank you very much. And I would like to have a special thing to the platform group for enabling Gadoff's in-app slide. If you have any questions, now is the time. Any questions, anybody? Yeah, Max here. Yeah, so in the beginning you said you were all like working in survival mode and I wonder how did you transition to this new model while in survival mode without hurting your existing infrastructure stability? Okay, so yeah, we were working in survival mode. We invested at least a quarter of not implementing any new feature, not working at all with any of the development groups. I have to say that we got a significant help from the management and organization to actually say, look, they're not gonna do anything other than pushing for stability. Any other questions? All right, come on up. Why did you prefer going with a telephone controller and not cross-plane or alternatives? Okay, it's actually a part of it. It's gonna be in an early session, so I'm briefly gonna answer that. The reason for that is that we learned that the community for telephone, it's much, much louder. Cross-plane might be the future. I'm a personal fan of it, but it doesn't really add up to the scale of upslime. And for the entire reconciled process, cross-plane will simply crash in our environment because of the large scale of our system. Yeah, so when you mentioned that you're looking at the telephone controller performance, how big of a problem is it for you at the moment? Like the performance of that telephone? Have you had to optimize, I guess? This is, again, something, it's gonna be a very good story, so I'm not gonna answer that completely. We have a different approach from the we-volved telephone controller. We shared all the white paper, so we have common knowledge, because we learned a lot, but we have different views, but we crashed our system several times without telephone controller. So our scale is much, much bigger than others. Awesome, anybody else? How easy or hard was it to push this new culture of platform among the organization? When I first joined Upslier, most of the items that I recently said that we are doing it, people told me it can be done. So they selected me to lead this topic. I'm the leader of all the topic. And the success rate for the topic was at the beginning extremely low. No one actually wanted to adopt, but using a lot of convincing and a lot of management support, it became very, very easy to push for the first results, and by giving the first results, it became very easy to push for the next milestones. The first results was actually covering a lot of infrastructure as code in very simple flow. The next one was we got the complete group approval, the engineering approval, the CTO approval, is stood in front of an entire stage. We were dog footing for six months and then the CTO stood in front of the entire stage. Look what these guys are doing, all of you are gonna do. So it's a great boost for us. Thank you.