 Hey everybody Good afternoon. I'm going to be talking to you guys about continuous deployment at scale a lot of what I'm going to talk about is Applicable more to medium to large scale projects that are more a long term, but There are bits and pieces you can of course take about for projects of all scale How many of you guys have heard of Etsy? Most of you guys so to give you a sense of the scale that we are working with we have a little less than two million sellers 26.1 million active buyers and a little under 2.4 million dollars in GMS and a little under thousand employees mostly in Brooklyn, New York My name is Prem Shree and I'm a senior engineer at Etsy before this I worked as a senior engineer at Yahoo And this is my second time speaking here. So I'm really glad to be here So I'm going to talk to you a little bit about the principles that guide some of the Ways we think about our continuous deployment. I'll talk about the specific tooling and culture that we have in place in order to For us to accomplish that and then we'll end with some Q&A so The one of the first principles is just ship like it sounds obvious But it often does not feature in process goals, which often makes it harder to actually do it We want to enable innovation we are our goal is to enable Building products quickly and efficiently and we want to be able to do this in a manner. That's I-tribal We are in the end. We are here to build products. We're not here to write. We're not shipping code We're shipping products As people who craft and people who build our We are motivated by some intrinsic motivators these include purpose autonomy and mastery and we want when we think of the way we build our continuous deployment tools We want to build we want to be able to think of these in terms of how they can optimize for these motivators We want to think of our tools not as hindrances to anything we build experimentation so That's basically a v testing. We want to be able to Test various ways in which we can optimize our products So for example, if I want to check two variations of home page and see which one performs better I should be able to do that. I should also be able to use whatever metrics I choose to mean success for a specific part of the product Whether it's bonds rates going low or conversion increasing and we should be able to do that easily We want to I trade quickly Now it's you're never going to get to a point where you can say a product is done or complete or hundred percent And that's completely okay, but we can make measurable improvements To all of our products and we should be able to do that Quickly and we should be able to I trade on those We want to be able to fail fast instead of having stagnant code lying around Baked into the idea is also continuous improvement. So as we Allow for our tools to help us deploy continuously. We also want the process to facilitate improving our own process We also want to optimize for a low mean time to recovery What that means is an acknowledgement of the fact that Failure happens we things will fail systems will fail And instead of preventing failure what you want to optimize for is how fast can you recover from a failure? So before we jump into it, let's look at what a typical continuous Delivery cycle look like so a developer or designer Come at some code that triggers a build Triggers some tests You do some user testing and then you're ready to release and of course at each of these steps There is a feedback cycle. So if any part of the cycle is broken, you can't move forward, right? So I'll talk a little bit about the tooling we have in place for each of these various steps so one of the things we do at Etsy is we do frequent check-ins and we check-in directly to master and It's I think when I say this to people people often are surprised we use github and github is really good at branching but branching in code also means it makes it incredibly hard often to actually debug when there's a problem in production and So checking into master frequently works really well for us and we do this by using branching in code And we do that using something known as feature flag So the Etsy's open source library for features of on github. It's basically very simple A feature flag is a bunch of configuration that tells whether a feature is on or off in this case We have a feature my feature that's turned on a feature can be off a feature can also be Ramped up slowly for a small percentage of users So in this example, what it's telling me is my feature is enabled for 1% of all users Feature can also be enabled For a small percentage of users and bucketed by either users or Through cookie. So here. This is bucketing by user tells me that Every time a feature is enabled for a specific user that feature is always enabled for a specific user So it's a difference between when you want a feature enabled for locked-in users versus when you want Something to show up or not show up irrespective of login status On the application code, it's a very simple check. You just check for whether a feature is enabled When you want to check for If a feature is enabled for user bucketing, you just pass in the user object Experimentation or a B testing we do that also using feature flags. So instead of having code that just checks for whether a feature flag is on or off You can have a multi variant feature where In this example for Layout one is enabled for 1% of all users layout 2 is enabled for 3% and layout 3 is enabled for 3% of all users So in your code instead of actually checking for whether a feature is on or off You will do a switch where you check for which variant of this feature is enabled for that user One of the concerns that often comes up with feature flags is that they go stale pretty quickly So if you released a feature and you turned it on The if and switch statements often end up remaining there. So it's a tricky problem, but it's not It's not affecting us too badly right now And one of the things we do now is send automated emails when a feature flag goes stale for after a certain period of time Since it's been on for all users And this all ties into continuous integration We always want to keep the bell green and we want to be able to release our code at any time so At the time when a developer is ready to commit code to production. We have a tool called try Try is basically a tool that sends a patch of Sensor diff of your working copy to whatever's on master and patches that and runs tests against it and Try is also available in kid hubs. So the advantage of using try before actually Going through committing code is that you can be sure that whatever you are working on will not break the build so this is Like a housekeeping task you do it so that you you're not affecting other people who are on the queue We for for actually deploying we have a tool that's built in called the planator Also available on kid hub. This is how it looks like we have We have an instance of the planet are running for each of our different stacks So we have our web stack, which is primarily written in PHP We also have a bunch of services written in go. So we have different stacks for each of those So once you develop, oh, sorry once you deploy, there's some tests that run once the tests look green You do some manual tests and then you're ready to release and like I said before It's just push button deploy. There's a couple of buttons there The green button is all you would click when you want to deploy There is of course a Step before production called we call it princess. It's a bunch of boxes from production. That's a deep pool So you basically get almost the exact environment that you would get on production Before you actually turn it off or turn it on for all users on production And so it makes it very easy to push anybody can push Anyone who starts at Etsy pushes something the first day they start it may be very simple But it helps you get into that mindset of being comfortable to push I Think it's important to understand that you still need to be anxious when you're pushing go to production You are you need to be ready to test you need to realize that what you are about a push is affecting a lot of users But you don't need to be Afraid because there are we have all these tools in place that allows you to push comfortably Typically in a lot of organizations the relationship between dev and ops is one where developers handoff code to ops and then ops goes off and pushes this code and There is there isn't a communication where people feel comfortable with each other. There is it's more adversarial But when you have a setup like this, it's more. It's easy to work with each other We also have we also allow for this idea of dark changes dark changes are Any changes? That you have extremely high amount of confidence in that won't be a great production Typical examples might include like simple template changes Some CSS deeks tweaks are unreferenced code unreferenced code is code for example That's say with it behind a feature flag, but it's turned off So while you're working on your feature and it's flagged off You can keep working on it and push it without actually going through the whole cycle of pushing code And by convention we just mark it as dark and so someone who's pushing that code actually knows not to worry about it Along with the typical push cycles for actually deploying code to production whether it's our web stack or Services we have config pushes so if a feature was enabled for 1% of all users and you want to Push it to 50% of all users you don't need to go through the whole web production Push cycle in order to push that we so we allow for a different Different deploy nator that just makes allows you to change configuration So we talked about how we use all these tools, but how do we actually coordinate all of this? We do this using a push train and we do this using very old-school coordination of people using IRC So what you see there at the topic is the push train What you see on the push train is there are There's two trains the first train has five people and the second train has one people one person So anytime I'm ready to push my code. I go to this channel. It's called the push channel and I join Once I join I get in that I get in line in the queue with whoever was leading the train Um Every time after every cycle say we push to princess our production the Pre-prod or prod if my changes look good to me I say dot good which tells the person who's leading the train that things are good and they can proceed to the next step You can say it at multiple languages If at this point someone else joins the queue and there is already a train in place They will get their own own train that they become the lead for So we do this using Bot call push bot the topic grammar for push bot is an antler and available on get up to So once you are done deploying it's time for you to know whether everything's looking okay It's time for you to gain confidence in whatever you built and we do that using a bunch of tools One of them is super grep super grep is basically an aggregation of All the logs that you have that are relevant to this particular stack aggregated together, so it makes it very easy When if any push that you were involved in suddenly cost a spate of errors of a specific kind Super grep can get noisy because it's an aggregate of logs from all all the different boxes in your stack But we also have a tool called super top, which is like top and village which will extract out the most common or popular errors that are That are popping up on all your different boxes Of course, we have dashboards There are a lot of systems So at see why dashboards that tell us about The various errors that we have on different pages, but then for example if any of your push Costs cost an increase in 404 so you will be able to tell relatively quickly the horizontal The vertical lines there that you see there are lines that indicate when you actually when there was actually a push that Went happen and the different colors indicate the different stacks For the pushes so one may be a deploy a web stack one may be a search stack and There are application specific dashboards so each team whenever they're building products They always instrument dashboards for the specific part of the product they're building on so It's easy to tell when any of the pushes are affecting their own systems We'll do this using stats D and graphite So to summarize Every time someone is ready to push to production we use They would before they commit code to the To github we run a bunch of github's of course and their developers are expected to run try to make sure They're not going to block the push queue And then they log on to they go on to IRC the push channel and then they join the queue in order to push They use the planator whoever's heading the queue pushes on behalf of everyone on that train Once you're on the planator you do some user testing make sure things look okay the things that you need to test manually After which you look at Supergrap and dashboards to make sure Things are so looking stable and then you're ready to leave the queue and you're done deploying Did anyone know what this is? It's an Yeah, it's an RJ45. It's an Ethernet cable How many ways can you plug this in There's only one how many ways do you think you can plug in a USB cable and One but it's very confusing So when we think of The tools we are trying to build in order to allow for continuous deployment You like to think of the metaphor of tokayoke, which is Japanese for mistake proofing So you want to always make your tools such that it doesn't take a lot of mental overhead to understand What step you're in or what you need to do in order to get your code out to production? And I think it's a very helpful metaphor to think of So we talked a lot about the technology, but I think you can't have an environment where you can push So easily quickly without having the culture to go along with it and There are there are some basic Ideas that you can foster in your organization that can help One of them be you you're always wanting to assume best intentions No one is intentionally trying to ever break production or break the code and so if you go into the rooms Already assuming that you want to blame someone for something that went wrong. You're doing it wrong because They probably didn't choose to break it on purpose So you also want to cultivate empathy you want to understand where this other person came from when in case things go wrong And you want to be open you want to be open to critique you want to be open to making your systems better and Failure is an option. I think a lot of people like to Say and think failure is not an option, but that's not a reality. I think systems always fail But however, how do you try? failures happen humans fail and so we We must acknowledge that failure is an option, but it's not something we want We really don't want to fail, but we can acknowledge that we still will fail and I think the great thing about failure is failure is also an opportunity because It's the one chance we can use to make our systems even more stronger. We can make our systems resilient And so we do that using post-modems and we do these post-modems going in Blameless so the idea for post-modem is not to find out during an outage Why an outage? Happened because of person X or person Y, but it's to understand all the various parameters are Variables involved in what led to that particular outage. I think Root-cost analysis is pretty common in technology, but root-cost analysis by definition Assumes there's a root-cost to things often there isn't one single root-cost to a failure and A post-modem can help us and understand the system in its More complex sense and give us a better understanding of the various levers We can pull in order to make the system more resilient whether it's for technology or process or humans And post-modem is often of course lead to remediation where we may file a bunch of tickets and figure out like whether do we need to make Changes in our process. Do we need to make just changes in our tooling? We also Have a tool called mark where we just document the various failures. We had so it's it's a pretty cool tool because you can often learn things from looking at other Failures, and it's very extensively documented in terms of exactly the Various Conversations that happened the steps that were taken in order to find out What happened and the steps that were Other remediation tickets that were assigned in order to make it more resilient and stronger This is the three-armed sweater we used to give give out a three-armed sweater award to the most spectacular failure of the year But so it's funny But I think it's important to realize like this the what is not given for failure itself the award is given for For For a failure that led us to make our systems even more stronger. So I think there's a difference between celebrating failure, but Versus acknowledging failure in order to make things better and there's a subtle difference there We don't we're not trying to glorify or celebrate failure for failure's sake We have a tool called mixer with an Etsy. It's available on GitHub. I think Mixer is this tool that sends out an email once every two weeks if for people who opted in that Randomly allows you to pair with another person within the organization and go for coffee or a Remote phone call or something like that the advantage of doing something like this is that Different parts of the organization get to interact I may not always know what it is like to be in customer support I may not know not know what it is like for a seller or a buyer to ask questions to To customer care if I'm able to connect with this person I can Empathize with this person's job Better than I would have if I didn't get that opportunity in the same vein for you know a product person to interact with a designer or Looking at the other way a support person to interact with an engineer to know what it's like to be in my shoes It creates an environment where no one's anonymous and everyone's actually a person it creates a better working environment for everyone to summarize We have a bunch of culture and tooling in place that allows us to deploy Um fairly often We most often succeed sometimes we fail when we fail we conduct postmortems which lead to remediation items And these remediation items feed into our existing pipeline that makes our system stronger In addition to this there's also you also need external stimulus like we can't have the same cycle go on without Having anything come in from outside and that's where learning from People outside like through conferences or meet-ups really help So we deploy fairly often now The various colors you see are the various Deploy stacks we have We we deploy around 30 times a day at least And 30 plus config pushes a day so over 60 pushes every single day And that's it. There's a lot of information on the various tools and the processes on the CODAS craft blog Thank you. I'll take questions Thank you for everything. Is there anyone that has any questions? So in your continuous deployment if you have a failure, do you patch quickly or roll back? We roll back Anyone else? No I'll be around if you guys have questions. Thanks so much Before I release you guys for lunch, I have a few announced