 Hi everyone. Good morning. So I'd like to tell you a story about how we open sourced our product and fixed our CI in the process. But it's a really story about how we moved from VMs to containers and how we improved developer experience. So something about me. I'm Michael. I work with open source for my entire career and I'm full-time working on open source for about five years. That also means I contribute to projects when I need to, just do anything that I need to touch. I'm feel free to contribute. And I also have my and everyone's development productivity on my mind. That means I cannot touch a project if it doesn't have a CI. I will have to build one first. And I work on three scale for about seven years. So what three scale is? It is an open source API management product. So almost every one of you probably used some kind of API management. You probably used some API keys to access some service. But API management is not just about API keys. It's a lot about customer relationship management. So you are doing things like billing, developer portal, workflows, application approvals, email notifications, analytics. There is a lot of UI. It's adding human to the mix. It's not just about between computers and API keys. What it means is that we have really big end-to-end UI test suite. And it's really slow. So three scale used to be close to our software for about ten years before it was acquired by Red Hat three years ago. So that means we got a lot of skeletons in our closet and a lot of things that we are cutting corners. Open sourcing took time. It took about two years to get everything ready. And one of the things we had to do was to actually fix our CI pipeline. By the end of last year we finally did it. So any project needs a CI, but open source more importantly so. Because it needs to scale. Like if you have a small team of developers in-house, it's probably a few people. But if you're opening to public, it's way more. And there can be dozens, hundreds of people contributing to your project. And they need a CI. And they cannot be waiting half a day before your CI build queue will finish. So it needs, and it's also a great place to actually enforce some kind of rules like the test should be passing or there should be some kind of style guide enforced. So our CI wasn't definitely good for external contributors because it was very local oriented on us. It was pretty slow. So something about our old setup. We were running on Amazon. We were using Jenkins and we were autoscaling EC2 instances by the Jenkins plugin. Everything was configured in Git, all the Jenkins configuration. We used Terraform for provisioning it. We were able to spin up staging environment pretty quickly. But still it wasn't really scaling to external contributors because it was our internal infrastructure. So we got a team about five people working at a time on the project. We were doing 10-20 builds a day, two to three pool requests and no automation around opening pool requests. Like no automatic dependency upgrades. And still sometimes developers had to wait for their builds because they were queued up. So our build time, the optimistic version, the worm build was about 15 minutes. Worm build is something that is using the same machine, same branch, same dependencies, and basically uses the cache that is already on the machine. So if we add more machines, the build can end up on a different machine. That was causing kind of cache issues and the cache locality wasn't as good as we would want. The build took about 15 minutes. But over like seven or eight machines, which was pretty, pretty hard. It was using about 11 CPU, blocking about 11 CPU hours, using about 45 cores and 9 gigabytes of RAM. You might already see some red flags with that. That's not ideal CI pipeline, right? But let me tell you what was our motivation to fix this. Maybe you will recognize some of the problems too. So we got random test failures. Tests were executed in the alphabetical order and adding new test file would change the order. So just adding new test would actually make some underrated tests fail because they were executed differently and they had dependencies between them. Or they were failing because of some state was not cleaned up or something. It was hard to know if the failure that you were experiencing is a real failure or a random one because it was happening several times. Developers were just rebuilding, rebuilding, rebuilding without actually checking if the failure is there caused by them or if it's just a random one. And rerunning the whole pipeline took time and money. If you are using 45 cores for one build and you have to rebuild it five times a day, it costs money. And that was causing another issue. So our build queues. One build was using several EC2 machines. And if all developers like our team is based in Barcelona and they like to go for lunch together. So they go for lunch together, they go back and they want to get their work done. So they just push their comments and it queues up all the builds. And so it will take time to spend the new machines. And spending new machines is not for free. And it takes time. And then the machine has no cash. So it needs to pull all the dependencies and the builds would be able to take 45 minutes. And if there was a random failure, they would have to recue. So developers had to wait and wait and wait. Another thing was maintenance of Jenkins. It's just another production system. So your CI is critical part of the infrastructure. If you have GitHub checks required or any other GitLab or anywhere else that are requiring your CI to pass the test to, you are not going to be able to merge a pull request or a hot fix or whatever if your CI is not working. If your tests are failing randomly, you are not able to merge a hot fix that happened in production five minutes ago. And you are going to have to rebuild or trust that it's safe and skip the checks, right? It can prevent you from deploying. We have checks that we cannot deploy Git revision that wasn't passing the CI. So you have to wait for the CI. It needs staging environment. If you're upgrading, you cannot afford to actually break it for your developers or anyone. So you need a staging to verify all the changes you are making actually are going to work. You need security updates. It's a lot of work. It needs monitoring. If it's down, your team actually needs to be on call if something happens. And one of the things with Jenkins like there is LTS revision, so you can actually have more stable version, but the plugins don't have it. So even though you are on the long-term support Jenkins that you don't have to upgrade it often, the plugins are not following the same strategy. So if you are using any of the plugins, all bets are off. We were running on master revision that was fixing some weird issues with Amazon and kind of crazy stuff. Last but not least were the costs. So we were paying about two and a half thousand euros a month for our infrastructure of Jenkins. Just in Amazon. This is not counting the maintenance cost and developer's slowdown. So the total costs of our CI pipeline are not just the Amazon cost. It's also the maintenance that is about four days per month of one person having to spin up staging, verify everything works, check all the plugins are okay and all that. And we have to count the developer team's slowdown. If developers are waiting for CI, they are going to juggle things, right? They are going to go to different pull requests. They are going to start something else. They are maybe going to change the cost of the CI and they are going to make it more built. So actually making the situation worse. But humans are not great at asking. So they are going to forget what they were doing before. And if you ask them one hour later, is this hot fix done? Oh, I forgot. I was running a CI and then I was starting something else and I forgot. So those costs are probably the worst than the actual Amazon costs. So we can start to look at alternatives and choose something that would scale for the team of not only our internal team, but also externally to have more developers and more contributors are concerned for mostly that everyone needs access to the build status, build logs, artifacts and equally. We cannot run it on our internal infrastructure where external contributors wouldn't have access. We got options in red hat where we could run it locally on different internal clusters, but then no one externally could actually get access to it. And we don't want to make people wait. If you are going to run a CI, it should run immediately. It shouldn't wait for the 20 people that actually open the pull request before you. Because we want to scale horizontally for the people. So we got a winner because we were already using CircleCI for other open source projects. It was quite easy choice. I have to say we are in no way affiliated with CircleCI. We are just happy customers. It worked for us great for open source. So decided this is a good path to. CircleCI provides a free plan for open source already. They also get some plans for bigger open source projects that are something like this. It's infinitely scaling plans. They have a button on the open source pricing where you can contact them and get in touch with them. Other than what I already mentioned, the problems we are solving like providing public build information, artifacts, and all that. And not having to maintain a production system or staging system or having someone on call for that. There are some cool features that convince us this was the right choice. So we were running before kind of naively splitting the test suite and just running everything by name, splitting it in chunks. This allow us to split by timings. Not all the tests take the same time. Some of our tests take a few minutes, some of our tests take a few seconds. And this allow us to balance time, balance tests across several machines. But to do that, you need to know how long they take from a previous runs. So how it is achieved is by recording the test runs by saving the JUnit XML report, which is pretty much standard nowadays. So Circles CI records those results. That gives you historical test performance. That gives you ability to split them actually into chunks by their time from previous runs. It gives you the most failing tests. It also gives you the ability to see which test exactly failed in your CI. So you don't have to scroll through the logs and figure out what was wrong. You will just see this fail with this line. This was expected. This was wrong. Also what it gives us is this workflow can be resumed from any point. So if we got really complicated workflow and only last step fails, we can rerun from failed. Because some workflows are more time consuming than others. We are running for MySQL or Oracle Postgres, because our products are for all of them, and some of them are taking longer. So each of those points can be individually run. If we have a pipeline that takes one hour and only the last 15 minutes fail, we can rerun the last 15 minutes. It actually increases focus, almost actually failing. You don't have to rerun the test suite, the whole test suite. So if you are debugging the failure, just rerun the last step. Don't have to wait for 45 minutes before you can actually take a look into what is failing, and how to take a look when it is failing. So if you are having a CI, it's really important to kind of get access to the CI. If it's a black box for you, you cannot really figure out what's wrong. So SSH is pretty much the holy grail of getting access anywhere, right? It can run shell and commands remotely. It can forward local ports to your remote machine, like you're running some service locally and expose it in the test and access it from there. Or you can do the other way around. You can actually forward the remote service from the CI to your local machine and access it locally. That includes, for example, Xserver. So if you are running a browser test suite on your CI, you can forward the Xserver to your local machine and locally see the actual browser and see what the CI is clicking on in the actual browser without running it locally. So no more guessing what is happening there. You just SSH in and see what is happening, actually, or after it actually happened, like after a failure, you can SSH in and rerun the tests. So how do you see CI pipeline looks like? It is a workflow that has several smaller jobs. Some jobs need more dependencies that can start earlier, like Unitex or RSpec. They don't need, for example, JavaScript assets, so they can be executed earlier. We can install dependencies for all of them just once. So we are installing Ruby and JavaScript dependencies early and then just sharing them with the other jobs. It is blocking less resources because they are more granular. We don't need to have 40 containers for installing dependencies. We just need one. So each job can have different concurrency level. Those multipliers are the concurrency level of each tasks. So because we got really big and fat tests with it, those cucumber tests are the UI tests we are running with a 40 times concurrency. And of course, we don't want to be paying 40 times concurrency for precompiling something or installing dependencies because it doesn't make sense. And that allows us to actually get pretty good savings. So we got about a half cost of our all set up. That was cost by several things. So first, it is priced per minute. So we don't have idle machines running for the whole hour. The workflow allows different levels of parallelism. So we don't have to have 16 core machines to run everything, but we can kind of have smaller machines for smaller containers for smaller workloads. It has way better caching because it can share cache between different machines and containers. And it can rerun those failures pretty nicely. And these is half of the cost after we doubled the team and added automatic dependency upgrades. And those are taking another half of the resources. So for example, if there is no JS kind of dependency security issue, it will automatically open pull request and merge it if it's passing all the tests. And those happen several times a day because there is like, we've got hundreds of dependencies. So actually our most used or like the biggest user of our CI is these automatic dependency upgrades. It uses about 50% of our pipeline. How we got there? Getting there was quite straightforward. It was not easy, but straightforward. We just had to run our test a thousand times. As we heard before, this was kind of the bug hunt. So just run it a thousand times and see what is actually failing. And then we had to fix everything we found. The major issue is flaky tests. So flaky tests are tests that could fail or pass for the same configuration. And they could fail or pass for various reasons. They could be because it's of some shared state, either in a database or a file or something or some concurrency in the test. Or they could be network or timing issues. And those are probably the ones that are affecting the horizontally scaling the most. So one of the kinds of flaky tests are the ones that have dirty state. So some tests require external state like records in database or files. And they fail to clean up after. So if the next test executes, it can cause failures because they are not expecting that. And it is happening when they are executed together, not in a standalone. So if you get this test failing on your CI or locally, and it's going to work because you're running just this one. If you would be running with this other test that was causing actually the leakage in the database, they would fail. Sometimes it's not necessary to clean up. You can like if you are having a database for every test or creating a new workspace, database transaction or something, you can just kind of garbage collect it later. Another failure is reliance on some other tests. That if you have some test that is making a global variable or stubbing some methods or doing something globally to the runtime of language you are doing, it's preparing something for the other tests. And if they are sorted, for example, alphabetically, they are kind of expecting that something is doing the setup for them. And if you execute them in isolation, they might fail because they are not being prepared. So what can happen is if you are parallelizing and those new tests are being added, those that are actually depending on something else can be pushed to different container and don't have the setup so they will fail. Or they will actually fail because they're executed with some other and they're doing the same thing. So how to ensure your test suite is healthy? So definitely start with randomizing the order of your test suite. By executing it in random order and running it a lot of times, like a hundred times overnight, you're going to figure out that actually there are issues. But before just blindly setting random order, you are going to need to know that you can actually reproduce the random order because random still has to be deterministic. So usually it is done by setting the random seat of the random number generator. So usually if you run the test suite with a random order, it prints the random seat. So make sure that actually you can get that number from your test suite and verify it's actually working and you can rerun in the same order again if you see the failure. It's usually done with some like plain text for matter from your test suite and to show you the order the tests are executed in and then verify you can execute them in exactly the same order, right? And you need to record test failures. By, for example, using the JNWIT test recorder, you can know what tests were executed in what order, what containers, and you can then try to reproduce it. So how to actually reproduce it, right? So when you get a test suite, how do you figure out what kind of tests were executed together? So this approach works only for deterministic failures, like if those tests are going to fail every time they are executed together in the same order, then you can figure it out. If it's caused because of concurrency, timing issues, or network, it's way harder. So usually just start with taking the seat from the from the CI and rerun them locally. The whole batch of tests that was happening in that one container. And you can start by using SSH to SSH to the container in the CI and rerun it there on the same machine, the same environment. And if it fails, bingo, right? That's the good thing if you actually can't find it fail. Then try it locally and see if it fails. If it doesn't, okay, you still have the CI environment to SSH and then try it there. And then usually just remove the half of the test you are executing and see if it's actually failing, like bisecting by hand. And then if it's not failing, just try the other half and try with different chunks and just try to minimize what is actually failing until you find those two or three that are actually failing together. This applies to the tests that are actually depending on each other or not cleaning up the state. And then it's useful if you have, for example, SSH access to the database. You can kind of see the log of the database. You can see, like, if you're using some things like Redis that don't have retransactions, you can use monitor to see what the test it is doing. One of the things that actually helped us to save costs was dependency caching. So installing dependencies takes time. Our project depends on Ruby, on Node.js, and just installing the Ruby dependencies would take about 30 minutes. And doing it on every build is definitely not a good thing. So there are several kinds of dependencies. The most common case are packages, right? Like your language dependencies, like in Ruby it would be gems, in Python it would be egg, I think, or something. Node.js it's just packages. You definitely need to use transitive dependency blocking. That means all your dependencies that are you using even your dependencies of your dependencies and your dependencies are locked in a file, the specific version. And you are going to take that file and that's your kind of canonical representation of your dependencies. Then you can easily cache it because if you digest that file, like MD5, you get a nice cache key and you can see that all those dependencies are in this cache file. And you know that if any of those dependencies or dependencies of your dependencies change, you are going to install the new one. That allows you to kind of use layers. For example, you can have a cache for a master branch and then for your branch and then for some for something else. And if you change just one dependency, you can kind of take the cache of the master branch and install just one missing dependency instead of installing all the all the ones and spending just one minute instead of 30. And these cached can be shared between between builds. And also there is one thing that like if you are installing dependencies, for example Ruby allows you to paralyze the installation. So you are kind of compiling all of them on all cores and not just serially. It's pretty helpful. There are other kinds of dependencies and that's internal ones. So for example if you are pre-compiling assets, minifying images and doing all that, there is no need to do it in every part of your pipeline. You can do it only in one and then share it with the other parts. If you are like you have seen we are running with 40 levels of parallelization. If that step would be adding one minute of pre-compiling images, we would be paying for 40 minutes of pre-compiling images. If we are moving that one step to a one level parallelization before it, it will still take the same time. Like from start to finish, right? It will take one minute and then seven minutes to run the tests, otherwise it will take eight minutes to run the tests. But we would be paying only for one time one minute, not 40 times one minute. So that allows to save a lot of money. So what a takeaway is, love your tests, run them a lot and treat them as a production code base. And if you love them they will love you back and give you good feedback. And everyone will be happy. So that's it. That's my, yeah, that's everything is open source. So we are actually making some improvements and you can kind of check people requests how to optimize and make your tests faster regarding installing notice dependencies and all that. That's it. Thank you.