 Yeah, hi So first of all, I want to apologize for the for the squaring in the title There was kind of like an internal name that we used and someone on my team said wouldn't be funny for you Get the talk accepted with this in the title. So I left it in but so I'm going to talk about something that we call shitless driven development and I'm going to share some tricks about About how to work with with really really large code bases and It's not going to be too ruby specific, but I think No matter which language you you work with hopefully you can find something useful Yeah, so for a little bit of context I work for Shopify, which is a e-commerce software as a service platform Headquartered in Canada and we are as far as I know one of the oldest and I think the largest Ruby on rails code bases in the world We've been using rails since Version zero point something like over ten years ago, and I think it's probably the biggest Ruby company in Canada This is a Ottawa the capital of Canada, and it's one of the buildings on the on the right is the shop for headquarter and My job at Shopify is I Basically work on one of the core architecture teams and a lot of my job looks like this so I I do a lot of very broad work very low-level changes lots of maintenance work and lots of work that affects the entire application the entire platform and most of the most of the take tips and Stuff that in this in this talk is Coming from that context a little bit so Shopify is a is a monolithic rails application That doesn't mean that the the tips that I'm giving here can't be applied to like other code bases, but just so you know this is where where this comes from but so we we We run a multi-tenant architecture, which means that we we host people's online stores And we have about more than 400,000 of those online stores And they're all running in the same application the same database the same the same deployment So it's not like multiple. It's not like each shop has its own deployment, but it's all the same application And we do we do about 20 to 40,000 requests per second and our main github repository has about 800 contributors which includes developers designers and content strategists and documentation writers all that kind of stuff and There's a whole bunch of problems that comes with you have so many people trying to change the same thing at At the same time and so quickly and so for all of those 800 contributors Have permission to merge changes to master. They all have permission to deploy to production and We with the rate of change that we have right now. We deploy about 50 times per day and those 50 deploys are about a hundred 50 to a hundred pairs a day and the other that kind of amount of change every day Gives you a bunch of interesting problems that you don't really run into with smaller applications but Yeah, it's it's a it's a it's a challenge so I want to phrase this talk is From the perspective of productivity problems. So I'm gonna I'm gonna talk about three Three important productivity problems that were That we were faced with and share some tips about how we how we worked on them, so the first one is deploys so If you have this many people working on the same code and the same application and they all want to deploy the deploys actually become a bottleneck and What I mean by that is first of all If you hire more people, they want to ship more code and shipping more code means that you either need to deploy more often or you need to have bigger deploys so one of the two and For for several reasons smaller deploys are often better, so a few obvious ones is that that fewer changes are often Easier to to debug is safe. Are you changing less code at the same time? It's easier to to roll back. It's easier to to revert and Is it to keep an overview of what's happening? So from from that perspective we we wanted small deploys and not bigger ones so now Important observation is that if you want small deploys and you want to deploy often you need to you need your deploys to be fast So as an example, I said we deploy about 50 times per day Most of our developers are in the same time zone So that means about six deploys per business hour and if those deploys take longer than 10 minutes Then they become a serious productivity problem for us because we can't ship the code as fast as we want to and that means we can't You know can't develop features as quickly as we want to so What do we what do we do about this? So first of all when I say when I say deploy, I don't I don't only mean getting the code into production, but I mean the entire Pipeline that comes with that so building if you use Docker For example, building a CI container running running your test on CI building the production container uploading the production container to Wherever you have to upload it to like get it on the server restart all of those containers Make sure everything is successful. So when I say deploy I really mean this entire sequence of steps So an obvious one is if you have CI builds you should parallelize them So if you have two people who want to ship something at the same time You shouldn't run the CI builds one after another But also if the same person has multiple tests, you can easily run the tests in parallel So that one is pretty obvious Another one that is there was super helpful for us is that you should build those containers in advance. So Before I said we have about 50 deploys, but about 100 PRs. That means some deploys contain more than one PR so if we build Those production containers for every merge to master that means we build a lot of containers that actually never get deployed But the huge advantage of that is that If that container once if someone wants to deploy that container then it is already Already and we don't have to build it in that moment Another really big improvement that we had is During the container builds We would often invoke different rake tasks and so on and each of those rake tasks would often boot the rails application And if you if your rails application is this big Booting just just running a rake task just booting Loading the rails environment before you can even start doing start running your rails rake tasks often takes up to 10 seconds or so Finding a way to combine all of those and so you only have to boot it once was a huge speed up for us Yeah, deploy too many servers in parallel. That's obvious. You don't want to do one server at a time if you have an application of this size And now if you If you look at all those different steps, so building containers running tests building the production container Restarting the applications all of those require booting the application So if you find a way to reduce the time it takes to boot your application That has a huge impact and different in many different areas. So that there was a really big improvement And then the last one that is often a little bit overlooked is how long does it take to shut down your application? So especially if you're running a web application and you're using unicorn There's a standard timeout value that says how long is a request allowed to run and if you want to deploy you have to You either have to terminate those requests Which is going to lead to errors or you have to wait for them to finish and if you wait for them to finish That means your deploy is going to take at least as long as it takes for that to happen. So Doing whatever you can to make sure you have as little long-running requests as possible is going to have a huge impact on The speed at which you can deploy The other bottleneck related to deploy is a human. So there's a couple of steps that You can totally get away with at a smaller company or a smaller code base smaller project But if you want to deploy a hundred times a day, then That's not going to work anymore. So One example is smaller companies often have an ops team and those that ops team is allowed to deploy But if you have 800 people and they all want to deploy if all of them have to ask the ops team to deploy that doesn't work. So You need to allow people to deploy on their own so Another thing asking Having someone that decides now is a good time to deploy doesn't scale Asking every everyone to pay attention to the status of Master CI doesn't scale Asking everyone to pay attention to errors during a deploy. So Having everyone watch the deploy to see was it okay. That doesn't scale if you have this many people and At the end even even asking even even saying hey every developer can deploy themselves even that doesn't scale at the point So in summary humans don't scale and you should automate this process as much as you can So to illustrate I want to show you The tool that we use but there's nothing really special about this tool You can easily write your own like the the point that I'm trying to make is that you should have some kind of tool And that tool should not be a human should be a software So this is our Deploy software it's called ship it it's open source if you want you you can use it or you can steal some ideas and write your own but a Few important parts here is that here for example, you can see it's waiting for CI and as soon as those tests are passing We automatically deploy this and there's no human that has to press a button or nobody has to Say it's okay to deploy now. So Basically we expect people If they emerge to master that basically means this is good to get deployed Another big one that we often have is that people make mistakes. So you merge something and then you Figure out or something was wrong. I need to revert it and then we often had problems where People had to manually keep an eye on oh shit this can't get deployed. So revert it Then lock the deployers make sure the first one doesn't go out without the second one and so on automating this in software is really was really Taking away some of the human interaction from this process for us. So this feature for example here says if there's a revert For the commit that hasn't been deployed that you know, then nothing in that range can get deployed And then if something passes CI after that it's going to get deployed automatically Another thing that you can automate is telling people that their code is now being deployed. So If you if you deploy stuff automatically It's still important that people know that their changes are going out. So we have we have a slack channel with We can where people get notifications to see that the code is going out and Another another important thing is that we don't we don't want people to merge too many commits into master So we don't we don't want the commits to pile up. So if there's a large backlog of stuff that hasn't been deployed yet, we want people to wait and so On a smaller application, it's okay If someone keeps an eye on that and then pokes the people and say hey, don't do this but if you have if you have a lot of people and if the application gets really big then this kind of like Educating people is also something that you can automate. So in our case if someone does this if someone merges To master while CI is failing or while there's a lot of commits that haven't been deployed yet Those people also couldn't get into notification saying you shouldn't do that This one is really interesting. I thought because it's a little walk around about Or like yeah, it's it. I would say it's a walk around for a missing feature and get up. I think where if you if merging to master basically means it's going to get deployed then That also becomes a bit of a bottleneck. So the workflow that we actually use is we have a browser extension that adds injects this button here into the github UI and People don't actually merge their PRs They just say this is ready to be merged and then later a bot comes and merges it for you when Some heuristic decides that now is a good time to deploy this So this means people can say okay, this is ready and then they can move on and work on the next PR and don't have to Basically the developers the humans themselves. I don't have to orchestrate This whole the deploy process, but it's another step that that we automated okay, so the the next problem I want to talk about is We are the name of my title comes from for the presentation title and the problem is basically that How do you deal with deprecations? So especially if you work on a very low level like a Team like the one I'm working on where you do a lot of framework Changes a lot of stuff that affects not only a certain feature, but the entire application then you often Or for example, the team that is responsible for upgrading to a new rails version and that kind of stuff So basically if you if you have an API an internal API that is being used by a lot of code and your job is to migrate from the From the old way to do it to the new one So the way that rails solves this internally is with active support deprecation notices and that's basically logging and everybody gets spammed with this log output and You just have to hope that people will fix it But the reality is people will not fix it because nobody feels responsible for those errors if you have 800 800 people working on the application So basically the the idea is you go and fix all of those methods to use the new one And now everything is fixed But the problem is that in the meantime while you did that someone else might have added a new A new class that also does it wrong or someone if you do be first and then you do see After you finish see someone might have unfixed be and did wrong again And it's really it gets really annoying if you have a lot of people and you try to make very low level changes You step on each toes all the time so What else can you do that is better than logging? so You can try to send an email and say hey don't do this or you can send a slack announcement Basically tell everyone use a new method not the old one and that might work if you have five people in your team but if you have 800 and New people get hired all the time and the old people forget it or maybe they don't care or For all kinds of reasons this doesn't really work so The idea is that the idea that we had is we we need to find a way to automate this We need to find a way to educate people What is the right behavior and do this education by by code by enforcing certain rules but without pissing everyone off and without Everyone having to come to us and ask for help so the The the other extreme that you can do is you can you can basically race in the old method and say this You can't use this anymore then you run your tests fix all the tests and then you know everything's good but if you run hundreds of thousands of tests and you have a lot of code then you can't really You're basically forced to make all those changes in one PR and If you ship a hundred PRs every day, then this is definitely gonna cause merge conflicts and all that kind of stuff so you want to find a way to fix those things one after the other ideally Like tiny slices one PR per change or something like that But without people being able to undo your work So the idea is basically if we have two classes that both do it wrong B and C. Can we say can we fix B? So that if someone Basically, yeah, basically can we whitelist some of those? users without allowing people to add new ones and that that is an idea that we Jokingly internally called the shit list. So a shit list is basically a list of things that is already shitty So stuff that is doing it Wrong and that's basically a whitelist. All of those people are allowed to do it wrong, but you can't add new stuff to it so for a second just assume here that the Deprocated method knows who called it. That's big. That's one of the important ideas And if if that method knows who called it you can do something like this where you say If the caller is either B or C then those two are okay because they they have been around forever and we are working on those But nobody can add new ones And now you can go and you can fix C removed from the shit list and now this one is still allowed This one is fixed and nobody can accidentally unfix it and also nobody can add new stuff so the problem with this approach is a little bit that I Kind of assumed that I was able to change the method that I want to deprecate But that method might be in a gem or it might be in rails or it might be You know somewhere that's outside of your control or maybe you don't want to go through all of those classes and change the parameters everywhere so Maybe maybe the level of granularity that you want to Maybe instead of saying B and C are allowed to call this maybe you want to say the shop model and the customer model are allowed to call this but not the checkout model or you want to say maybe the Internal web requests are allowed, but not the external ones or maybe the Background jobs are allowed to do it, but not the web request so There's different granularities that you might want and basically the key to how do you implement this granularity is how do you how do you figure out who called it so and something simple that you can do is you can You can come up with with an annotation basically if you look at the bottom We have a controller or and a job and those jobs basically register themselves to the shit list basically with a What I called context here, so it says the context that the code is now running in is the shitty controller full method or the shitty job and then the shit list itself can say This is allowed This this should raise an exception unless it's coming from the shitty job and then basically the workflow is First year what your white list you're allowed constant years empty you run your tests you put everything in there that fails the test That's one PR that you can ship and now you're confident that nobody can add any accidentally add any new shit and then Your task is now Basically remove one item from this list see which tests fail fix all of them move on to the next one and this is really great for For generating to-do lists or basically giving your team a Progress indicator because you can you can see how this list is getting smaller and smaller every day It's super awesome for motivation because people feel like they're making progress the list is getting smaller and it's much more measurable than a Lock full of deprecation spam that nobody's gonna look at yeah, so Yeah, I think I said that so to summary to summarize This is in our experience very valuable if you need to change very broad behavior If you are maintaining some kind of internal API if you need to break down a huge task into small chunks This is For my team has always become the the go-to tool It's awesome for generating to-do lists for having a Something to just work through and it's also awesome to educate your team About how you want them to write code and how you want them to what kind of methods you want them to use in which kinds of methods you want them to stay away from and This this education happens at a At a code level so you don't have to talk to all those humans, but you just make good error messages and The error message should then explain. What are you? What do you want from people? So as an example If you have certain certain behavior that you want to deprecate The error message should explain what are you doing wrong? Why is it wrong? It was working yesterday. I was in our own. How can I fix it? Who can I talk to if I if I don't know how to fix it and Basically the an error message like this Via code it enforces good best practices for whatever it is that you want to That you want people to do okay, so the The third problem that I want to talk about is the problem of unreliable tests So this one might not be such a big deal if you are working more like a Service oriented architecture, but if you're working on a monolithic rates application, then this can get really really annoying really quickly So the interesting thing is that there's a there's a I mean most people probably know what the what the problem is with those when I say unreliable unreliable tests, but They're not annoying enough to really force you to do something about it But if if the more tests you have the more people you have and so on those problems become really Not just like it, but actually common. So if I say when I say unreliable tests I mean I mean a test that sometimes passes sometimes fails without you making any changes to the code and For some context we run about 750 CI builds per day That's 10 minutes each and about 70,000 tests and if only a single one of those 70,000 tests is unreliable And fails one percent of the time randomly then we lose over one hour of productivity per day And this is the those numbers are based on the assumption that Those tests are running on your branch So that's one hour of productivity for one person if you apply this to master where the fading test affects way more people then this is even worse so There's two common types of unreliable tests that I want to talk about One is the flaky test. So that's the the one that is easy to spot easy to debug It's just a test that you you see all the time. Sometimes it fails. Sometimes it doesn't Often that is time-dependent. So maybe You have like a couple lines in your code and if there's more than a second in between Sometimes them calculation doesn't match anymore or maybe the test only fails if your CI systems under load because something is out of memory or all that kind of stuff and the second category is way more sneaky because The test that is the problem is actually not the one that is failing So those tests are auto-dependent. So you might have a test a that fails only if sorry to test be that fails only if test a ran before and Yeah tracking down those tests and fixing them is going to be Super important because yeah, they can be a huge productivity killer for for your team So, how do you track them down? So a lot of people probably know about Software like BuxNec, which is an exception tracking software that a lot of people use in production So every time that happens an exception in your production system You lock that exception somewhere and you get like data analysis features and all kind of stuff I thought this was a really cool and interesting idea To not only use this in production, but only use it for your tests So every time a test fails in our CI system We actually report that as an exception and then we can we can use all of those and I'll data analysis features on those Test failures and you get all kinds of cool stuff where you can see when did the stat test start failing? which PR might have cost it or You even get like alerting you can say if a certain test fails more than Five days in a row or something like that. You can notify someone or ping the author automatically and that kind of stuff. So As for most problems the the very first step here to fixing the problem is that you need the visibility first So you need you need to figure out what is actually wrong? How bad is it? How often does it happen? How many people are affected and so on? Yeah, so after you've identified which tests are problematic How do we fix them so with the The first kind the flaky test That's the test that sometimes fails and sometimes passes if you have a suspicion that this test might be flaky Obviously you want to you want to confirm if that's actually the case so you What we do is we have a little script on that runs on our CI system And those little green and red boxes here each box is one container that we run in parallel Each container runs that one single test in isolation a thousand times. So here. We basically run the same test 64,000 times if it's and if it looks like this it basically means sometimes it fails sometimes and passes and you have confirmation that This test is actually a problem The other one that it's a little bit sneakier a little bit harder to debug is the leaky test. So If you If you're not so familiar with how testing frameworks like mini mini tests or aspect work I found this a bit confusing at first when I was first learning about test driven development is that It doesn't actually create a new process for each test But it runs multiple tests in the same process and that means if those tests By mistake somehow mutate global states those that that mutation is still visible in the next test so it's possible that those tests affect each other and Leaky test is a test that makes another test fail by modifying global state and a Really great way to find those tests is Using binary search. So what you do is basically you You look in your and your monitoring that I had before and you look at the list of tests that ran and the last one It's going to be the one that failed But as I said that the one that failed is not actually the problem The problem is one of the tests that ran before it because one of those cost the last one to fail so you take that list and you divide it by half you run the first half and then the the failing test and If it fails again, you know the problematic test needs to be in that Half if it doesn't fail it needs to be the second half and then you repeat this and Repeatedly cut through the list and basically perform a binary search through the list of candidates and then the tool that we have at the very end says If it does identify a leaking test then it says this is how you can reproduce it locally Here's your leaky test and the test that fails because of the leaky test and then Yeah, you can basically track it down this way so Putting all those pieces together what we do is We have this automatic monitoring if a test fails too often or for too many days in a row you can automatically Confirm is it leaky is it flaky? Ping the author of the test say hey, this is a problem. You should fix it And this is super valuable for for productivity okay, so Quick summary at the end. I talked about three problems The first one is deploys so if your application gets really big and if you have a lot of people and you want to ship a lot of code the one of the really important things you can do is Make sure your deploys are really fast because if they are fast you can deploy more often and you can deploy smaller which is safer and Yeah, so Besides making them fast also automate them make sure there's a little human involvement as possible Problem two that I talked about was what I call too many cooks in the kitchen. So you have too many people trying to Change the same thing or stepping on each other's toes or accidentally undoing each other's work and so on and What I would like you to encourage is try this approach of shitless driven development, which is basically a Fensia version of deprecation warnings where you know once you fix one warning It's impossible to add new ones or unfix that one And the last one that I talked about unreliable unreliable tests the important thing that I want you to take away from this is that you can actually use a lot of your production monitoring tools for your tests and that you can get a lot of insight out of that and Using the binary search Approach to find leaky tests is really really powerful Okay, so which one times we just stay exactly two questions, okay, so yeah, here's good. I'm really glad I was sitting close to you So my question is I've seen a lot of really powerful internal tools here and some of them I think the underlying philosophies are applicable about across code bases, but the specific tools might not be so I'd love to talk You to talk about Shopify's decision-making process for what is important to build here You know you mean which important tools are important Like yeah, which which internal tools are important to build how that how you can make internal tools that fit with the grain of your process effectively That's a good question. I would say The ones that affect the most people the ones that have the most impact across the company are probably the most important ones so for us the the test stuff that I talked about was as More often as you the more often you want to deploy the more annoying it gets if there's Unreliable tests so that was one where we thought oh shit This is affecting 500 people and if this is if it's if there's too many flaky tests Then there's 500 people who can't get any work done. So that was a good candidate for something that we need to work on I don't know does that answer the question. Yeah, thank you Okay, great. Let's try and get someone from the other side of the audience Okay Hi, so first of all, it's a very interesting talk and I have a question on the subject of unstable tests because I've also had my share of you know That's that suddenly break one day. So can you share your experience on you know the most annoying and difficult unstable test that you fix and how did you fix them? In my experience the the flaky tests are usually pretty easy to To identify and usually pretty easy to fix as well the the really annoying ones are the leaky ones where the test that is fading is actually not the test that is the problem and Stuff that I've seen a lot is So so often the problem is caused by state that is being modified in one test and then some other tests somehow is affected by that and Something that I see a lot is Stuff related to caching so someone was trying to be smart and cast something in like a global variable or often like related to rails auto loading where The first test caused a certain class or a certain constant to be auto loaded and then the second test Behaved differently because of that. That's often really annoying There's a lot of very Annoying details about how Transactional fixtures for example work in rails like the one thing that a lot of people seem to run into a job fire is that if if your test Makes a table modification like an altar table statement to Add a column for example Those statements for example actually cause a database commit which means the test does not correctly roll back for changes and all like those kinds of really Intricate details that most people don't know about and they shouldn't need to know about it But they affect your tests like those aren't my experience really annoying Thank you All right cool. Thank you very much Florian