 Hi, my name's Emil. Today I'm going to be talking about testing rails at scale. So I'm a production engineer at Shopify. I work on the production pipeline on performance on DNS. Shopify is an e-commerce platform that allows merchants to set up online stores and sell their products on the internet. To give you a little background on Shopify, Shopify has over 240,000 merchants. Over the lifespan of the company, we've processed $14 billion in sales. Any given month, we do about 300 million uniques, and we have over a thousand employees. So when you're testing rails at scale, you typically use a CI system. And I wanted so that we're all on the same page about how I think about CI systems. So I like to think of CI systems as having two components. You have the scheduler and you have the compute. The scheduler is the component that decides when a build needs to be kicked off. Typically, it's a web hook that comes in from something like GitHub. It orchestrates the work, so it decides what scripts need to run where. Whereas in contrast, the compute is where the code is actually run, where the tests are actually run. Now, the compute is everything that touches the compute as well. So it's not just the machine, but it's the orchestrating the machine, making sure it's there, getting the code to be on the machine, and everything that's involved. If you look at the market of CI systems, you typically have two types of CI systems. You have the managed provider. This is a closed system, multi-tenant. They handle both the compute and the scheduling for you, and you just give them the keys to your code base. Some examples are Circle CI, Codeship, or Hosted Travis CI. In contrast, you also have unmanaged providers. So these are systems where you host both the scheduling and the compute in your own infrastructure. It's an open system. You have access to the code base, so you can make whatever changes you'd like. And to give you an idea, some of these are like Jenkins, Travis CI or Strider. Today, Shopify boots up over 50,000 containers in a single day of testing. During that time, we built 700 times. So we built Shopify 700 times. For every build, we run 42,000 tests, and this whole process takes about five minutes. But this wasn't always the case. Winter around last year, Shopify's build times were close to 20 minutes. We experienced serious flakiness issues, not just from code health, but also from the provider we were on. We were the biggest customer of this provider, and they were running into capacity issues, so we would get problems like out of memory errors. The provider we were also using was expensive, and not just in the dollar amount, but you pay typically for a hosted provider by month. But if you think of your typical workload, it's five days a week, eight to 12 hours a day. So the rest of the 12 hours, you're not using that compute time. And so we set on a journey to go and solve this problem. So we were given the directive to lower our build times down to five minutes. At this point, due to the level of flakiness and the long build times, you would have to rebuild your build, even though the suite should be green, multiple times, like two or three times before you got agreed and got to deploy. And the goal was also to maintain the current budget. So we looked around on the market and we found an Australian CI provider by the name of BuildKite. The interesting part about BuildKite is they're a hosted provider, but they only provide the schedule component. You have to bring your own compute to the service. And the reason that's very valuable is because for 99% of use cases, the scheduling component is the same for any CI system. And this satisfied our non invented here worries of rebuilding the wheel. So the way BuildKite works is you run BuildKite agents on your own machines. Those agents talk back to BuildKite. BuildKite also ties into the events for your repo. So when you push code to GitHub, BuildKite knows that it needs to start a build. You tell BuildKite what the exact scripts you want the agents to run. Those agents pull the code down from GitHub, run the scripts and then propagate back the results to GitHub, or sorry, to BuildKite. And BuildKite propagates to wherever you need the results to be sent to. So the compute cluster we have is at peak, 90 C4 8x larges in EC2. That gives us about 5.4 terabytes of memory and over 3200 cores. So the cluster is hosted in AWS. It's auto scaled. We manage it with Chef and prebuilt AMIs. The instances are memory bound and this is because in the containers that we run on the instances, we put in all the services required for Shopify to boot. And finally, we have to do some IO optimizations on these machines because of the right heavy workload we do when you download a ton of containers. So we use RAMFS on the machines. So I mentioned we auto scale our compute cluster. We couldn't use Amazon's auto scaler because Amazon's auto scaler works only on HTTP requests. So instead we had to write our own. It's just a simple Rails app, but the way it works is it pulls BuildKite for the current running BuildKite agents. And it checks how many are required. And BuildKite calculates this based on the number of builds it needs to run currently. Scrooge then goes and activates boots up new EC2 machines or scales it down. That basically works. We also tried we kept cost in mind as we built the system. So we do some AWS specific optimizations. So that includes things like keeping the instance booted up for the full hour because Amazon builds by an hour. It also includes using spot instances and reserved instances. We try to improve utilization. So if a machine is booted, since the machines can be booted up for an hour, even though we don't require that capacity, we allocate a dynamic amount of agents for builds. So at peak for branch builds, we can give up to 100 agents or up to 200 agents for master builds. Keep in mind, not one size fits all. So for us, AWS and auto scaling works for other companies, bare metal might be the correct solution. So the funny thing about Scrooge is BuildKite agents are this implicit sort of how productive developers are at the company, right? Because when you're pushing more code, you're more productive. And so we can track the amount of productivity per se going on. And so I took this graph from an average day. And you notice there's three points. Can anybody guess what the two valleys and the one peak is? Any guesses? Lunch. So a Shopify at the bottom, that's an UTC time. And lunch at Shopify is at 1130 to 130. So that first peak is the first lunch rush. People get up and they go to lunch. But what, sorry, that first Canyon is the lunch rush. What's the peak? Well, what do you do when before you leave your computer and you're working on something? You commit, work in progress, and then you push it up to get up, right? So that's what that second peak is. And then that big dip is everybody going for lunch. So I mentioned containers. The large speed up we got with the compute was using Docker and using containers to run our tests. So we were able to get a large speed up because we do all the configuration that you would need during the container build. And then you only have to do it once. And the moment a container is on a machine, it can instantly start running tests. So things we do are things like we get our dependencies on the machine, we compile all of our assets. We also get test isolation from Docker. This isn't as big of a deal with Rails, but it's still quite useful. Finally, Docker provides a distribution API. Most things speak Docker so we can put the container anywhere we want as long as we announce where the registry is. So Shopify has outgrown Docker files. We have our own internal build system called Locutus. It uses the Docker build API to build containers using bash scripts. It ran on a single EC2 machine at the time of this first iteration for our CI system. And this single EC2 machine wasn't dedicated to Locutus. It was one of those machines where you have a bunch of apps that need to run in production, but they're not critical for production. So you like put an app, you put an app, you put an app. And then eventually you have this a bunch of apps on this one machine and it's become production critical. So it was one of those. And building containers for our CI system forces us to repay a lot of debt, a lot of technical debt that the app has grown. Shopify's 10 years old, the code base. And so you're a cure, a lot of technical debt. And so while we're trying to build containers, we run into really weird issues like compiling assets required in MySQL connection. For test distribution, we went with the simple solution. Every container ran a set of tests plus an offset based on the container index. We had two categories of containers. So some containers ran Ruby tests and some ran browser tests. And the issue we ran into with this is the Ruby test pool was much larger and browser tests are much slower whereas Ruby tests were faster. And so what would happen is the Ruby tests would be complete running and then the browser tests would take a couple more minutes resulting in like longer build times. For artifacts, at the end of a CI run, the agents on the boxes would go into Docker, grab the artifacts and upload them to S3. We also had an out of bounds service that would get webhooks from BuildKite, dump some of those artifacts into Kafka and emit some metrics to SatzD. All roads and Shopify on Kafka lead to Dataland. So we were able to use some of those artifacts and then later go and find flaky tests or flaky areas of the code base. And this was our first iteration. This is what the final architecture of the first iteration looked like. But then Docker decided to strike back. So we shipped a second provider, but when we shipped the second provider, we brought a bunch of confusion to the company. We decided to run both of them in parallel. We also noticed that a single box doesn't scale and Likudis started running it to capacity issues. And the two having two different types of containers run tests was making our builds longer than they should have been. So battle of confusion. When we decided to ship both CI systems in parallel, so we could gain more confidence before we rolled it out and removed the old CI system. The problem with this is we did a bad job of communicating to the whole company what we were planning on doing and how we were doing it. And developers saw two statuses. One was green, one was red, and they weren't sure which to trust, which to believe in. And so this unfortunately eroded developer confidence for us. The solution for it was to full on switch the new system to 100% and take the dive in. Clustering Likudis. So when we outgrew our current Likudis instance, we nearly had to go to the drawing board and rebuild and make it scalable. We also wanted to keep it stateless as much as possible. And so this is what we ended up building. We ended up building, so the new, the old version of the Likudis had a single instance. It would get the webhooks and it would run, it would build the container and then push it up to the Docker registry. In this new, in this new version, we had a coordinator instance that would get the webhooks. It would then allocate the work to a pool of workers. The workers, each repo was hashed to a particular worker. And so the same worker would receive the same work, or sorry, the same, yeah, the same work. And now when I say stateless-ish, that's because there's a cache on each one of these machines. And when you can lose the cache and the containers will be able to build the, or the workers will be able to build the container fine. The problem is, though, is once the cache is lost, you could take upwards of 20 minutes to build a new container, which just doesn't work that well. So our second stab at test distribution, we, the first container that would boot would load up all its tests into Redis. The rest of the containers would look at that Redis queue and pull off one by one the test suite, or a test job. We also got rid of test specialization. So containers ran all tests, and this equalized the running time of containers, so they would finish within tens of seconds of each other. Finally, this is what the second iteration of our CS system looked like. So Docker, the gift that keeps on giving. So no one tests starting tens of thousands of containers a day. And Docker doesn't do that, unfortunately, but it's exactly what we were doing in our CI system. We ended up running into a bunch of instability with Docker, and we didn't account for these failures. And this unfortunately did erode some developer confidence in our new system. Every new version of Docker had major bugs. They would fix old ones, but they're bringing new ones or bring back old ones. Some examples are we'd see network timeouts randomly happening. We'd see kernel bugs where Docker would refuse to boot up if Aparmar was on the machine. We saw issues where concurrent polls would cause deadlocks. So that was a lot of fun. And since we didn't allocate for this, we would cause builds to fail. You would have a green test suite, but your build would fail. And that's very annoying for a developer. So the solution was to actually swallow the infrastructure failures, identify when they were occurring, and swallow them. And one thing, going into this project, you hear stories from Google where a dry fails every couple minutes, and you think, well, that's Google, that's not for us. Well, what we saw was, even at our scale, we still saw over a hundred containers fail a day, which made us realize we can't ignore this problem. So the solution to thinking or to approaching your infrastructure is to get into this mindset of pet versus cattle. So you want to treat your servers as pets. You don't want to treat your servers as pets. Don't treat your servers as pets. But the way you can sort of identify if you are, is you give each server its own unique name. If something breaks, you SSH in, you find out what the problem is manually, and then you create an artisanal fix, and then you move on. In contrast, when you're treating your servers as cattle, each server has a number. Node 1, node 2, node 3. You want to automate detection of issues. You want to remove it automatically from the cluster. And you want the node to know how to clean itself up, and then put itself back in. And we had to go and do this, and until we had done this, we had a lot of toil on the team, where we would manually find a broken node, go and fix it, and then put it back in, and we just wasted a lot of time. Side note, while I was making the slides for these talk, or yeah, the slides for this talk, I found a bunch of pictures of cats with lasers in space. And so I just wanted to say that I love the internet, and I think we all deserve a round of applause for making this possible. Now, our third iteration on our test distribution was actually stability, or so the problem we saw with test failures is you can get into this race condition, where a container pulls a test off of the queue, it fails, but then since all the tests are run, and all the tests were green, and nobody knows that this test was never run, the build is green. And that's a very scary proposition, or a situation to occur. So what we do now is when we de-queue a test, and when it's after it's successfully run, we in-queue it again in a second set, or we insert it into a second set, and at the end of the build, we know all the tests that should have run, and all the tests that have run. We compare the two, if they don't match, then we fail the build. This is a rare situation, we don't see it often, but it's good to have that safekeeping there. So this is what the final iteration of BuildKite looks like today, and this is what it looks like internally at Shopify. So in conclusion, don't build your own CI if your build times are less than 10 minutes. It's not a productive use of your time. It took a long time for us to get through this project. We had multiple people working on this for months. Also, here's a small application, typically the issue isn't compute, typically it's a configuration issue, and you're likely to be able to find large optimizations when to build your own CI. If your build times are over 15 minutes, you should start considering implementing it on your own. If you have a monolithic application with snowflakes all over the code base, and you've optimized as much as possible, getting your own compute and being able to have more impact on that could be a very effective. Also, if you've reached parallelization limits in your CI provider, having your own compute allows you to break past that. If you do decide to go and build your own CI system, please don't make the same mistakes we did. Be sure to commit 100 percent once you've built your new system. Beware of rabbit holes. I know we all like to say it'll be done in two weeks. It's very difficult for that to be the case. Finally, make sure to think of your infrastructure as cattle, not pets. You'll save yourself a lot of headache and time. Thanks. The question was, did we spend any time optimizing the code base or the tests, instead of just focusing on the CI system? We actually didn't. We found that parallelization was enough at the time. When you have something like 40,000 plus tests, you're going to have some slow ones, and it just evens out in the long run. The issue we did find with the test code base is flakiness. You'd be surprised by the amount of tests that assume state because of the order the test suite is running. When you distribute tests from a queue and they're on different containers, the state is different. We had to spend a lot of time and develop some tooling around going in and figuring out why a test is flaky, fixing it. That's where we spent time fixing it. The question is, how bound is the system to Docker? I would say most of the speed up we got was actually from Docker. Not Docker itself, but using containers. The reason is when we built the container, a lot of the time today you spend in a CI system or in most CI systems is the configuring the application to be able to run tests, so compiling assets, downloading the new gems, on and on. When we built, when we use Docker, we were able to do all that once and then all the instances just could instantly start running the tests. A lot of the speed up did come from Docker. We also did gain a lot from the parallelization, so you would see a lot of gains there. The question was, what was the time frame of the project, essentially? We started initially working on this in the winter last year. By the summer, we were, I would say, around phase two, so most of the company was already using build kite in this new system and we had seen the performance gains, but during that time we spent a lot of time going through learning the, hey, now half the team is like trying to fix these machines or we're still seeing quite a bit flakiness because of the test distribution we're doing. That lasted until about September, at which point the project mostly wind down and the team moved on to other things. The question was, did we maintain our costs? And the answer is yes. We were able to keep the same amount, we maintained the same budget. It was sort of like you can fill up this budget. We had more compute capacity, faster build times for the same amount of money. The question was how big was the team? Around six to eight people. It shifted, but like I would say around there, yeah, in that range. Thank you. Thanks.