 What do you do when the toolset that is supposed to make you a lot more efficient actually holds you back? You switch to a more efficient toolset and that's what on Fido did as they started scaling their microservices environment to over 250 microservices. Our next speakers Tom and Pedro specialized in building tooling that helps teams ship faster and more efficiently, thereby reducing the overall cycle time. They will talk through their experience of moving from Jenkins to GitLab CI-CD for their microservices environment as well as the benefits and challenges that they faced along the way. Over to you Pedro and Tom. My name is Tom and my name is Pedro and we're here to talk to you about our journey from Jenkins to GitLab CI to do qualitative deployments at on Fido. Our journey begins in 2019 when on Fido used a combination of Bitbucket and Jenkins to build and ship our software. At the time we had less than 70 developers and a fairly manageable matter of repositories, but we knew this would begin growing quickly and rapidly in the near future. We used a custom build Jenkins pipeline consisting of about 8,000 lines of groovy code to provide a simple and common build, test and deploy process for all of our services with a minimum amount of boilerplate. An example of this is on the right. So you can see that it is a template that runs some tests using Docker compose deployed the resulting service to multiple clusters and runs some cucumber tests at the end. New repositories had to be configured by a mode request to a central Jenkins repository, which wasn't great to begin with. We had other issues with Jenkins specifically. Speed, stability, isolation, simplicity, flexibility and user experience. So when it came to speed Jenkins was not great. Our Jenkins setup was built on EC2 instances that required a large amount of time to spin up when load increased. You could wait up for 10 minutes for a node to become ready, which completely kills the quick sort of build test iteration loop that you get used to. Stability was an issue. It would often lock up and require restarting. So it was very common, maybe even several times a day for someone to go on slack and say, hey DevOps, Jenkins is dead. Please restart it. We even added several customer emojis for this specific case because it was so common. This is exacerbated by Jenkins architecture, which involves a central master node that coordinates all builds on slaves. So if the master fails, then all builds completely hold, even if they could continue by themselves. Our services are containerized, but building them was not isolated in Jenkins. So a common thing to find was that the EC2 instances that ran the Jenkins slaves would run out of disk space, which would mean that future builds would then not be able to complete, which was incredibly frustrating. There were other issues as well that just caused the nodes to become unavailable. So a build would be scheduled, but the node wouldn't start the job or it would fail for random issues. And the DevOps team would have to go and manually prune those instances on request from developers. Simplicity. Jenkins is not a simple technology at all. For example, pipelines that are written in Groovy can be restarted from any point, including the middle of executing a loop or even a function. It snapshots the current state of the pipeline after every step, which is super cool and it's also super crazy, I think. And the end result of this complexity is that it can be incredibly difficult to work out exactly what Jenkins is doing and why. We often found ourselves digging through very huge and verbose Java projects with classes like Jenkins interface, abstract factory, proxy bean, et cetera, classes to see why it's not running something simple like Docker build with the correct arguments. These things should be very simple to do, but Jenkins made it quite hard and opaque. Flexibility. This might be controversial to anyone that's worked a lot with Jenkins. Jenkins is endlessly flexible and that's a pro and a con. There's a fairly large ecosystem of good and bad plugins you can use and you can author those with yourself like those pipelines, you can build them yourselves with the power of a full programming language backed by a large standard library. However, the flexibility is hugely undermined by the complexity of the system and in practice it locks out a huge number of developers who just want to get stuff done. In my opinion, true flexibility isn't when one or two people in an organization or team have the required knowledge to wield that flexibility effectively. It's when everyone in an entire engineering team can build and customize their own pipelines in a reasonable amount of time and with a low barrier to entry. And last but not least, user experience. At the time, the Jenkins UI was terrible, viewing logs would sometimes freeze your browser and in general the interfaces looked really dated. It wasn't well integrated with the rest of our tooling. It just wasn't a great user experience at all. So we set out to solve these problems and we settled on GitLab CI to do that. We configured our CI cluster to run on Kubernetes, which is a platform that we use for all of our other workloads and that gives us a quick scalability, isolation and reduced costs due to more efficient allocation of builds. We created a large set of includes that projects could use to quickly bootstrap a project. These include functions like building an image using a Docker and Docker daemon, so it's isolated. Testing that image in a separate build stage so you don't have to rebuild the image to run the tests and then publishing that image to a central container registry prior to deployment. The included definitions would also set up stages for deploying to clusters. So centrally managing these deployment targets meant that adding a new cluster could be done in a single place rather than having to go through and update every single service. More importantly, GitLab CI resulted in a high degree of team independence. So teams could use the pre-op pipeline, but they could also add other steps with the full flexibility that GitLab CI offers. So this is the end result of a pipeline. This is an example service. So we build our image, we run some tests, we publish it, then we begin deploying to one of our many different clusters. The team working on this project has customized it with extra things like load testing and contract testing. Some of these are includes that they've built and some of them are service specific ones that they've just added to GitLab CI. So when we started the migration, we had a lot of bespoke legacy and untemplated Kubernetes resources that we wanted to migrate to a more standardized deployment. So we started by building an internal tool called Kube, which applies Ginger 2 templating to Kubernetes resources, including variables, conditionals and custom functions. Ginger 2 is a very common Python templating library. It's used in the Flask web framework, for example. And it's quite neat and simple to read. So you can see in the code snippet, we have some variables that are interpolated. So the namespace, we have support for conditionals. So if you wanted to add a specific thing, attribute or whatever block in a specific cluster, you can do that. And we have custom functions. So one of those is JSON from, which reads a YAML file, runs templating, and any interpolation that's required converts the result into a JSON object, which can be then safely embedded in the YAML. On top of this, we wanted to add some more consistency to our deployments. So this includes Kubernetes specific labels. It would be nice to look in Kubernetes and know where each deployment came from, the origin source, the repository, even the team that owned it. So we added the ability to define transformers inside Kube to mutate the resources that is actually templating. So Kube would run the templating, it would create the Kubernetes resources, and it would lint that you could define simple functions that operate on specific entity types. So in the example, we added a simple function that operates on a project ID, operates on the deployment, and adds a project ID label to each of the deployments. We added more complex ones, like adding specific data tracing labels. And we found this to be quite an effective way of ensuring consistency, despite us having a lot of ad hoc Kubernetes resources. We also added support for custom linting rules to Kube to catch common errors. The one on the screen shows there's a backoff limit and is applied to every single job before it's deployed. There's other ones that we found common patterns and common errors that we wanted to catch. Doing this in simple Python code allowed us to be quite flexible. At the start of the deployment, at the start of the pipeline, Kube would lint the resources after they generated, and they would display an error that's defined in the doc string of the Python function, and it would display what the issue was and how you can fix it. So in this case, adding a backoff limit and it's important because not having it's a bad thing. So the variables that are interpolated, in part, come from scoped environment variables. So we have a set of these configured for each cluster, each deployment target, with information about the namespace, the AWS account ID, and other internal values that might be interpolated. And all these resources are placed in a common directory inside the root of the repository. The tool combines these results into a single gamma file, lints it, validates it, injects it, mutates it, and then deploys it to Kubernetes. The end result of this is we have 2.12 million builds across 300,000 pipelines. We have over seven years of CI build time in total. We've had zero outages, for instance, across this, despite tripling the engineering team size and hugely increasing the number of projects, pipelines, and jobs. This is amazing. However, plot twist, is this the new Jenkins? Yes, is this the new Jenkins? The problem with creating a new pipeline that is meant to replace exactly step-by-step an existing pipeline is that it's going to inherit a few of its issues. Complex groovy led to complex YAML and bash. And given that the old pipeline in Jenkins was extremely rigid and difficult to manage, it liked a lot of, for example, error management, which ended being inherited by this pipeline in the process. If something went out of the expected path, for example, if a deploy failed, or if a build had problems, or if the task failed, the pipeline provided at best cryptic error messages that were not exactly easy to digest. You can see that as an example in the screen here. The deployment of a new deployment fails in the middle of a robot. And after several other steps that appear with a lot of green that looks very positive, it even shows a random success message at the bottom. This would result in several messages on the DevOps support channel asking what had happened that the logs seemed good, but that the pipeline had failed or it failed to deploy somehow. And even though it was much better than Jenkins, there was no need to have to restart Jenkins every two days or every half day. This was still complex and led to a lot of false positives in the process of configuring and using the pipelines. The other large problem was about the fact that we had a lot of duplicate resources. YAML was spread across all the repos of projects that we had. And the teams that owned those repos weren't exactly the people with the expertise and the interest in improving and changing that YAML to improve things in general. As a result, it was burdened sometimes to have to go to every project and change this. It was in fact so complicated that we ended up creating a tool to automatically open a merge request in every single repo for each of the changes that we had to do. And we had to use this quite a few times. Even then, this meant that a change that could perhaps be done in two or three hours in a centralized situation took weeks to do on a decentralized approach like this. With the start of the year and the start of the quarter, we had a new goal across engineering to improve our stability. And one of the things that we wanted to do to improve stability was to be able to increase the pace of deploy without increasing the pace of errors. So in a way, our goal was to deploy with confidence. How did we go to do that? How do you improve the confidence on deploying? Well, we look into three specific approaches. First, we decide to reduce the impact of failure. We try to add canneries by default to all projects and send a small portion of traffic to the new version before going fully live with a new release. We also decide that automatically rolling back a new release when the error rate is high was a good approach. So we implemented a way to monitor the error rate of the deployments. And based on the percentage of traces that had errors as a result, immediately roll back if it went over that threshold. And last, we decided that moving the canneries away from the developers repos and into a centralized approach where a dedicated team with the knowledge and the interest and the information required was a good approach to make this more scalable when it comes to changing YAML. For the implementation of canneries, we and for the automated rollbacks, we went with Argo Rollouts. Argo Rollouts is a specific project within the Argo CD project, which is incredibly useful. It provides an easy and familiar way to implement blue greens and cannery deployments by creating its custom resource definition, the rollout, which is always similar to the existing resource definition of Kubernetes, the deployment. The advantage of this is that even though it's almost exactly the same, the rollout strategies allows us to also specify a blue green or cannery. And inside that rollout strategy, we can also add analysis templates that allows us to query either Kubernetes or some other metrics provider to roll out, rollback automatically if there is a high chance or failure. And instead of having the YAML in each repository, we used Helm to create a template, a centralized template that abstracts away all the complexity from the developers. The developers instead of having all the YAML in the repository have a set or values files where they can specify a high level approach to what will be then implemented on the template. Although we can use Gitlabs AutoDevOps directly, we took every inspiration from it on the implementation of the solution. One quick thing I want to add, which was incredibly useful for us is the use of the Helm unit test plugin, which allows us to allow us to have confidence that the rather complex template that we were developing was correct and did not introduce bugs with every new release. In a way, we can now deploy the pipeline itself with confidence. In a nutshell, Helm unit test allows us to create suites of tests. We can then specify parameters to the values and then assert statements about the resulting generated templates. We use this extensively to either test the behavior and to prevent regressions either in implementation or interface packs. So as a result, this was what was required before with Kubo to deploy. We can see here several things. It might look a bit scary, but this is just Kubernetes YAML. We specify a deployment with a set of parameters. We specify a service, an horizontal pod autoscaler, an ingress to expose the service to the outside and a migration job. Fairly standard resources all bait a bit verbose given the nature of Kubernetes. The resulting new pipeline can do the same thing in much less code. In here, we can see the values YAML used to generate both the deployment service, ingress, horizontal pod autoscaler and horizontal pod autoscaler. It reduces the amount of code required by easily three or four to a third or a fourth of the total size required for normal projects. It abstracts away all of these implementation details from the developers and allows them to be centralized and to be updated at once for every project if need be. In addition to that, the GitLab CI.YAML stays pretty much similar. It removes the needs for some variables that were required before as they are now automatically inferred from the values files that are specified by the developers. The resulting pipeline is also very similar. It keeps the same process of running the unit tests, running the builds and publishing the Docker image as well as publishing as the point to the various environments. The main differences are that the deploy jobs now have automated knowledge, thanks to Argo, to roll back automatically if there is a high error rate. And in addition to that, we also added a manual rollback step next to the deploy step itself in case there is any other unexpected situations to make this quick and immediate for the developers. In a nutshell, there were several benefits that we were able to see on our migration from GitLab to Jenkins. GitLab proved to be much better than Jenkins in many aspects. It managed to scale on the number of jobs and on the number of the CI pipelines that we ran in a way that Jenkins never was. We ran now many, many more times of jobs and hours of pipelines than we did before with Jenkins. And it doesn't fail in the process. It just is much easier and stable to use. And at the same time, it managed to scale in terms of complexity as more and more projects were able to implement their own unique specific situations in addition to the standard pipeline that we provided. In a way, by being simpler to use, it opened the doors for more developers to use it and touch it and modify it and ended up being more flexible than Jenkins ever could for those people. It's not all roses. There were some issues in this development or some challenges that we encountered. So one of the questions we asked ourselves to begin with is would auto devops be a solution to this? It looks nice. It seems somewhat similar, an opinionated framework using Helm that kind of just deployed things. It's promising, but it is also very opinionated. It's probably a good thing if you are starting completely from scratch with a fresh set of infrastructure on GitLab and you are okayed with adopting everything that auto devops gives you. However, if you're migrating from an existing infrastructure or existing deployment process, it's not simple. You have to adapt your projects to work with this. It just looked like a huge amount of work. On top of that, there were things missing like deployments to different clusters, specific rollout patterns that we wanted. We run multiple development production and staging clusters for redundancy. It wasn't entirely clear how we would fit those patterns into auto devops and how we would customize it going down the line. There were some surprising things as well. So during the very start of this, the lack of scoped environment variables at the group level was a quite annoying limitation for us. We used scoped environment variables to configure cluster specific variables for interpolation with Kubernetes resources, so for example namespaces. These have to be defined at the project level. Unfortunately, we had to build some tooling that would take a set of these variables and the scopes and it would have to use the GitLab API to apply those to every single new project that's created under the onFedo group. This could be implemented as a pipeline. It's okay to build, but it's just an annoyance that we had to overcome. We also had to set up web hooks for new projects when they're created to then trigger this pipeline on those to then configure the variables. If it failed for whatever reason, the deployment steps just wouldn't work because those variables or the scoped environment variables just wouldn't exist, which is not a great experience. On the talk of web hooks, things like deployment event web hooks didn't exist at the time. This has been changed recently, I think, but we wanted to put an event in Datadog or Splunk or another monitoring system whenever a GitLab deployment succeeded or failed. So this counts for like Terraform, Kubernetes, copying something to S3. Whatever the project counts as a deployment, we would like to have it consistently monitored and logged in our tooling. To get away with this, we had to go around the lack of web hook support for this. We have to embed some various tools inside each of the images that do deployments like for Terraform or anything else that would then cool some external services with some secrets and credentials that are configured. This is not ideal. What we really want is a web hook sync that just receives, hey, this project deployed this, it can then put it in Splunk, put it in Datadog, copy it to wherever it wants. The hash itself, it's powerful, but it's complex. And there's definitely a lack of reasonability for commands, which is kind of annoying. So at the start when we built, when we transferred Jenkins to GitLab CI, we kind of took the existing Jenkins groovy code and translated it into Bash, including loops and et cetera, et cetera. A better way of doing this for us would have been to be more modular. So maybe use Bash functions or use specific tools that we write in other languages and compose them together with GitLab CI. The lack of usability was surprising as well. So things like CI blocks before script, after script, script, any kind of array stuff isn't merged like the overridden. So if you extend the build image stage and you want to add one statement to the before script, you would have to copy the entire contents of the before script, add it to the bit you're overriding and then also add your specific customizations to this. This has been changed, so you can now reference those and you don't need to do this. But this has changed recently for a long time. This wasn't possible and it led to a few copy and paste incidents. Other than that, there are many small paper cluts along the way. GitLab is a big platform that's growing quite a lot and evolving. There were some bugs around the way that we include things, not small issues, but overall not too many, which was refreshing to see compared to issues we had with Jenkins plugins and other CI related things in the past. So thank you. You can contact us at the links on screen and we work from Fido and our Fido is hiring. So if you are interested in coming to work for us, we have a large number of very interesting challenges around the document identity space and the bunch of other spaces we're launching into. Come check out our careers on the link on screen and come work for us. Come experience our beautiful pipeline. Cool. Thank you. Thank you.