 scaling rails at Shopify. This is our quick story of how we survived Black Friday and Saturday Monday last year and in the past few years. My name is Christian, see Georgia on Twitter, and yeah, don't follow me. There's no point. I'm from Montreal. So three months a year, this is what my job looks like. Cars are just airy in snow, people push buses, and there is maple syrup heists. And there's kitchen, which is probably the best way to come on trail. So I work at Shopify. Shopify is a company that is trying to make commerce better for everyone. Our platform allows merchants to sell stuff on any channel, on multiple channels. Primarily on what we call the web channel, which is websites. So we give our merchants the ability to customize the HTML CSS of their websites. They also have access to liquids so that they can really fully customize their building field of their site. They also have a plenty of sale for pretty much stores. And you have a mobile app for people on the go that want to accept the events. Our stack is a pretty traditional Rails app. If you've seen John Duff's talk earlier, I'm probably going to repeat a lot of things, but use Nginx, Unicorn, Rails 4, and Ruby 2.1. So we're on the latest versions of everything, except Ruby, I guess. Use MySQL. We have around 100 apps servers running in production, which accounts for roughly 2,000 Unicorn workers. We have 20 job servers with around 1,500 rescue workers. I'm kind of scaled or talked about. So as I talk about scaling, I need to like throw big numbers at you, otherwise you just won't be impressed. And this whole talk will be kind of useless. So we have 150,000 merchants on Shopify as of last night check. And these merchants account for around 400,000 requests per minute on average. But we've seen peaks up to a million requests per minute during what we call flash sales. These requests amount up to, we basically processed up to $4 billion in GNV last year. So if you do the math, that's around $7,000 per minute. So any minute we're down, we're basically burning money. And someone somewhere is losing money. So because we're in the commerce industry, we have to deal with this, these really fun days called Black Friday and Cyber Monday. We, yeah, Black Friday is just crazy. But Cyber Monday, we actually call it the Cyber Monday. Because usually when Black Friday goes well, we can just kick back and relax for Cyber Monday because it won't be any worse than Black Friday. So this kind of stuff happens in malls, like people go crazy, like they fight for each other to get like this TV and stuff. But it turns out, Black Friday is pretty crazy on the internet too, who would expect. So we see around, so last year we saw 600,000 requests per minute. So that's about two times our average traffic on a normal day. We also processed three times more money during those four days than on average. So it's a pretty big time of the year for us and we just, we can't afford to be down. Everything has to go perfectly. So in order to understand a bit better the decisions we made to scale Shopify, you have to understand that we use unicorn. So each request ties up the unicorn worker. So in order to scale Shopify, we need to either reduce response time or increase the amount of workers we have. So I'm just going to go through the various techniques that we've taken to reduce the response time and hopefully you'll be able to take some of this and apply to your own apps. So our first line of defense is what we call page caption. So the idea here is you make this observation that if let's say 10,000 people hit the same page at the same time, chances are what we're going to respond to the same thing. So it's kind of crappy that we're doing all this computation 10,000 times, right? For 10,000 requests at the same page. Would be cool if we can just do the computation once and serve the same data to the rest of the people. The problem here is that as you can see there's this thing called on this particular page there's the amount of items in your cart. On some pages people are logged in also so the page won't be exactly the same. So we wrote this gen called catchable and what it is is a generational caching system. So what this means is that we don't have to manually bust cache. Because busting cache is the hardest thing to do in computer science from what I read. And the other thing is about naming things that's tough. Off by one error. So yeah, so the idea here is that we don't have to manually bust the cache. The way this works is that the cache key, so the key in Metcache is based on the data that you're actually caching in Metcache. So I'm just going to go through a more typical example of cacheable looks like. So in this, this is a post controller with a very simple index action. We're scoping the posts per shot because we're a multi-tenant app. And we're also in pagination. And you'll notice that we wrapped the action of this thing called response cache. Response cache does all this nice math. And you'll notice there's also a method called cache key data. So whatever this method returns, we're going to basically do a two string on it. And we're going to do an M5 hash on it. And that's going to be the key in Metcache. And the value will be whatever is yielded there. So the response. So here's like an example of how we're generating the cache key for this request. So you notice the shot by these one, let's just pretend. The path is posts, format, whatever, grams. Like we decided that we're going to put the page per end in the cache key because you don't want the cache for page one to be the same thing as the cache for page two, right? And you notice there's this thing called shot version. And this is what makes it generational. So every time a post is updated or created or deleted, we're going to increment this counter. So what happens is that if this shot version is in the cache key for everything that's cached, all the cache will just go away and we're going to start populating a new cache key. Does that make sense? Yeah. The other thing that this library gives us is GZIN support. So when we cache the HTML into Metcache, we GZIN it right away. So when the request comes in, if we find a key in Metcache for that cache key, we just take whatever is in Metcache and just serve it directly to the browser. So the nice benefit here is we're also saving on bandwidth because we're sending GZIN data to the browser directly. On the front of saving bandwidth, we also do eTag and 304 not modified. So if the browser decides to cache the data within the browser, we don't have to send anything to it. We just tell it 304 not modified and it just serves it up from the browser cache directly. So let me show you some numbers. This is what our graph looks like for cache hits versus misses. So the blue line is cache hits and the misses are the purple line. So we get about 60% hit rate on this page caching. That's huge. That's like 60% of 400,000 requests per minute, which is absolutely crazy. So these requests don't hit the database. They don't do any parsing of liquid templates, any compiling of liquid. They really just take the data from that cache and you serve it directly to the browser. The problem with paged caches is that when we have a sale, so let's say some shop does this massive sale where lots of people are buying stuff at the same time, you'll notice on the graph that the cache rate goes down. And this is because we're continually updating the inventory on the products being bought, which bumps the shop version. So basically, the shop is not running any page cache during a flash sale, but we still get 40% cache hits in that case. So it's still pretty good. Our second line of defense is query caching. So we do around 60,000 queries per second, which is absolutely crazy. And so we, we want to reduce the stress on the database. We have this thing called identity cache, which is a gem that's open sourced. And what it does is it caches, it marshals down active records and it caches and directs into metcache, so that we don't have to hit my SQL when we use the records. The cache is often by design. So the idea is that when you want to use the cache, you have to actually, there's a method called fetch, and when you use fetch instead of find, you're actually loading it from identity cache. The idea here is that in mission critical areas, like say a checkout process, you don't want to rely on cache, because cache can be wrong. You want to really hit the database directly. So we decide to make this often by design. The caveat to identity cache is that, unlike generational caching, we have to manually bust the cache for IDC. So we have an after command book that whenever a record is modified or an association of a record, we go manually to the keys and metcache. So the problem with this is that there could be race conditions where you manage to save the database, but you don't manage to clear the metcache keys, but it's something that doesn't happen very often and we're okay with that trade off. So what does identity cache look like? This is a very simple example. It looks like a product model that includes identity cache. A product has many images and you notice that we're caching the has many relationship and we put them then true there. I'll explain what that means. So basically, you see, instead of doing product.find with the ID, we're doing fetch. So this will actually load the data from the database that it's not cached. And when it does that, it's going to save it into identity cache after the fact. You also notice that we're doing fetch images. You can kind of see like what's going on here, right? You replace find by fetch and that's how you use identity cache. So the cool thing with embedding is that these two calls do one metcache call because images are embedded within the same record. Does that make sense? It's pretty cool, right? So like we're saving two MySQL queries over one metcache query, which is really good in the grand scheme of things. Identity cache also allows us to provide secondary indexes. So you don't always want to find a product by ID, right? Like in our case, we use handles. So your product is like slash product slash the handle. So identity cache allows you to define secondary indexes so you can load a product in our case by shop ID and by handle. Let's look at some graphs again. So this is the cache hits and misses for identity cache. Comparatively, this is pretty crazy. So even basically the blue line, every time there's a cache hit, we're saving call on MySQL, which is pretty crazy. During a flash shell, there's no dip. Because during a flash shell, like I mentioned, all we're really doing is we're updating inventory count. So we're doing a single update on a single product. So it's such a such a small thing in the grand scheme of things that that there's no dip at all. So these are two strategies. The third one is backgrounding things. So because we're doing commerce stuff, we have to deal with any gateways. I'm not sure if you've dealt with any gateways before, but this is a 95% of response time. My seconds. So if our unicorn workers had to wait five seconds during a sale, we'd just be down. Like there wouldn't be anything to do. So we background these kind of things. So we background a lot of things. We background web hooks, email sending, payment processing jobs, also like fraud analysis, basically anything that doesn't have to be done in that request, we background, so that we can release the unicorn workers as soon as possible and continue processing other requests. Nice benefit of doing this is that, depending on how you set up your queue, you can do throttling with with background jobs. So you can say only allocate a maximum amount of workers to a specific queue. And you know that only that many jobs were popped up at the same time. So now what? So we have early in place to handle 600,000 requests per minute, right? Thing is, regressions happen, right? And the best way to know if regression happened is measuring things. So you have this thing, a shop right where we just measure all the things. So you have thousands and thousands of graphs and and measures. And the way we do this is with stats, if you all use stats before, but it's basically a server that you run that you throw numbers at it and it aggregates these numbers and it gives you 95% how minimum is maximum counts. You name it. And with this data, you can then plot plot it on different back ends. So we have this gem that makes it a lot easier for us to instrument our code. It's called stats the instrument. And this is an example of how we use it. So you have this class called liquid template. We can extend it with the module. And then we call stats the measure on it. And what stats we measure will do is it's going to measure the amount of time it takes to call the render method. And it's going to save that metric into the liquid template dot render stats. One just gives us in the end is we can plot these graphs of the 90% percent out of of liquid template render method, which is pretty cool. The gem also gives us stats to count. So you can count the amount of times things are called. So in our case, we count the amount of times the format is called on the processing job, which gives us the amount of processing jobs that we run. So this is all fun, right? Like what is this good for? We use this service called data dot, which is one of the, which is a back end to stats key. And we plot all this data on our dashboard. And this is actually our health dashboard. So at a glimpse of nine, we can see if Shopify is doing well or not. And we can identify reflections pretty quickly. Cool thing about data dogs, it does alerts. So I found this, I was looking for a screenshot, I found this alert. So one of our ops set up an alert on whenever the temperature of the ops room goes about 24 degrees Celsius, it fires off like these alarms if I'm pretty funny. But you get clever and you really score alerts with data dog. That sounds all fun and perfect, but it's not perfect. You know, we have all this in place, regressions can still happen. And sometimes they, they, you don't find out, but then you tell it's too late. And we don't want this to happen. So we do, we do a little testing a lot. We have this tool called Genghis County. Basically what it is. It simulates Black Friday and Saturday Monday. Sounds pretty crazy. It's actually really simple. It's just a tool that, that simulates a person going through the checkout process and buying something. And it just does that thousands of times concurrently for, for many, many minutes. And we just see what happens. We're basically just the Dawson Shopify in production. You might, you might as well break before Black Friday. It's going to break. And we do this several times a week. It helps us plan for keeping it safe. It makes, it ensures us that when Black Friday does happen, that we're going to be totally fine. At least for things that we control. How many of you use MySQL? Wow, okay, cool. Because we're expecting a lot of like Postgres or something. So we use MySQL at Shopify. One thing that happens sometimes is there's smoke rates, right? And MySQL gives us, right, a really nice tool called MySQL smoke rate. It's a really nice tool, right? It logs this to a file. It's so useful. So it actually is useful if you can figure out what causes smoke rates. So I want to go through like a, or three steps of how to determine the root cause of the, of the smoke rate. Because I find it pretty interesting that I figured it would be useful for others to know this. So here we go. Step one, if you use an Nginx, there's this module called, I put the link there, Nginx request ID. What this does is it exposes a variable in your Nginx config that you can pass along as a header. And it's just a unique ID for this specific request. That one helped us, not alone. The second step is there's this thing called log process action in Rails. And what it allows you to do is it allows you to add stuff to the last line of a request. You know how it says like, completed tourner, okay, you can add stuff there. So we add the request ID there. So we're getting there. Step three, right first, use a marginalia, which is a base caption. And what this does out of the box is it adds the name of the controller and the action that performed the query. But we also add the request ID there. This is pretty crazy because once all that's done, our stochery log looks like this now. And that's way more useful because we can see exactly what request started from Nginx causes stochery. And that will allow us to make it easier to debug the root cause of the stochery. We actually have like this nice like Nginx to Rails to stochery relation. And there's a bonus too. We add the request ID whenever we queue a background job. So this allows us to know what requests to the job. Because sometimes it's interesting to notice if you're debugging something. So the next thing I want to talk about is resiliency. Anybody know what that means? No? Or maybe you're shy. I don't know what it means. I'm going to read a quote. This quote, okay. A resilient system is one that functions with one or more components being unavailable or unacceptably slow. Can that make sense? So here's what happens. You start building a Rails app. You're having a really good time. You have to wait. You need to use sessions, right? Because you want to remember if someone's logged in or not. So you add this session store. Then you continue coding away. Now you need background jobs. So you add Redis. And then you add that cache. And then your users want to be able to search for whatever reason. So you add Elasti Search. And then the next thing you know is someone calls you up with this screenshot of this famous like 500 arrow and you're like, oh god. And then like the person's on the phone or pissed off because like they can't get to your site. What went wrong? So what went wrong is that you just assume that these services work, right? I mean Redis doesn't go down. Like you did sudo apt-get install Redis. It's on the same machine. It shouldn't go down, right? So you assume that things are always up and fast. But in reality that's not the case. And basically don't let minor dependencies take you down. You don't want something like this session store to take your whole app down, right? Because really the only thing you need this session store for is to make sure the customer's logged in or not. So you probably have this code in a before filter that checks if there's a session, i.e. it loads a customer. The problem with this code is that if the session store is down, this before filter is just going to explode for every single request, right? And that's bad. So what can we do here? Well, we can rescue Datastore on available. I mean it works. It's probably not at the right level of extraction, but this is something that we should always do. We should always not take for granted that the session store will be up. And you should do this for every Datastore except for like I guess your database. Because if your database is down, I guess your whole app is down. So I think sprinkling these rescues in your code base will help, but you don't have tests to ensure that these flows do work without these data stores. Someone can go around and just remove the rescue and think that, oh, it's useless. And then you're back to state one where your app goes down. So we built this tool called Toxic Proxy. And what it is, is a very simple TCT proxy that you put. And it's not just rail specific. It's really just a, it's written go. It's approximate that you put between your Rails app and your services. And what Toxic Proxy does is it allows you to simulate a service being down, or even worse, a service being slow. Because if the service is down, you'll get the response rate away, right? Like services down, connections fail. But if the service is slow, well, that's another solution. It's just slow. The cool thing about Toxic Proxy is that we released a Ruby library that allows you to control it. So we have Toxic Proxy between our Rails and our minor dependencies in the development environment and in the test environment. And what this allows us to do is we can write tests to assert that, for instance, in this case, we're testing that the, when a session store is down, that the request to slash still responds successfully. Now we're absolutely sure that this flow works, even if the session store is down. There's this really nice blog post on the post-it slides after the talk that describes the process of making Shopify resilient. I didn't encourage anybody to read it, but essentially the TLDR is we did what I just described for all the minor dependencies we have. So we came up with this nice table of here's the Shopify checkout, here's the Shopify web channel, and here are all the services that it depends on. And we just make sure that whenever one of these services are down, that there's a proper fallback to make sure that we don't render 500, so that we try to fall back smartly and serve 200s. Because what's worse, like what's worse to the user, seeing a 500, or seeing that they're logged out temporarily? Seeing 500 is obviously better, right? So I mentioned slow resources, so this is a tough one. So we have three shards, so we split our data into three MySQL databases. So for those of you who don't know what sharding is, is basically we have data for say shop one, shop two, shop three on shard one. So this is one MySQL database, and we have the same thing for like shard two, shard three, so we split our shops into three shards. And then we put rails in front of that, and whenever request comes in, using the host name we can determine what shard that shop is on, and we query that database. It sounds really cool, right? But there's a problem with that. What happens if shard one is slow? Because the same rails have is serving all three shards. If shard one is slow, well your unicorn workers are going to start responding slower, and at one point they won't be able to take any more connections, right? Doesn't that kind of defeat the purpose of sharding? Isn't sharding, isn't the point of sharding to to be able to just peel off one shard, and still be able to serve trash to the other two shards? So we thought about this, and we're thinking, well how can we make it so that shard one being slow doesn't affect shard two and shard three? How can we fail fast on this? So we have this this gem called senian, which is a smart circuit breaker. So the idea here is I'll just show some code to make more sense. We register the shard one as a resource, and we say there's five tickets, so you can do five queries on shard one at a time. There's a time loop, so if there's a sixth query coming in we wait 0.5 seconds until that ticket is free up. If it doesn't free up, then we just pretend that my single shard one is not there. So that's our way of failing fast. So if there's a slow query that's causing shard one to respond slower, we'll only respond slow to five requests, and then the other request will just fail right away. So there's like there's a couple other settings here. This is our error threshold. So the idea here is if we have 100 errors, we're just going to pretend the shard one is down. So we'll open the circuit, and after 10 seconds we're going to put the circuit to a half open state. The idea here is that we're going to let a bit of traffic go through, and if we see that shard one is healthy again we close the circuit, but if shard one is not healthy we reopen it. So the idea here is that we reduce the impact of one day to day being slow for the rest of the connections. And the way you use this in Rails is you basically acquire the resource and you do your query within that block. So if you go back to our example, now if shard one is slow, Semi-N will kill it off, and we'll be able to still serve traffic in shard two and shard three successfully. So what else can go wrong? So many things can go wrong. These are all the things that we depend on. We depend on shipping rate providers like FedEx, UPS. We depend on pancake ways, strike, PayPal, fulfilling services, internal services. So during Black Friday all these services are thrown the same amount of traffic as Shopify. So even if Shopify can scale our internal services we're still at the mercy of, say, FedEx for tackling shipping rates. So for this we have manual circuit breakers. Basically they're just flags. We wrap our things with if statements and we can manually go into a panel and disable a specific service. So let's say PayPal's having a hard time during Black Friday. We should throw in a panel, disable PayPal, and Shopify continues working for everybody else that doesn't use PayPal. That's all I had. Any questions? Is Genghis Khan, are his attacks unrespected? Like you know he's coming or is he leaving? So this is pretty cool. Okay, so for our, is Genghis Khan predictable? Yes. Yeah. So we have a Google calendar event where we say do a Genghis run at this time. So that is predictable, but the cool thing about Genghis Khan, Riley's not Genghis Khan specifically, but the cool thing is we use Genghis Khan to do a trip you've heard of KS Monkey before. It's just a thing that Netflix does. So they just randomly pull flags and like they throw a monkey in the data center and just wreck hat, right? So we started using the Genghis flows so that the strings for this too. We get it for free, right? So we have one that's predictable, but we also have a our KS Monkey that just wrecks havoc, but at a lower intensity. But our Genghis runs are really like here's what we expect for next year's Black Friday and we run it and we have our mouse on the on the stop button in case anything goes wrong, but in most cases we go gradually so we we know like we've never actually put Shopify down. You got us in production. Yeah, so how does the so when we put in the processing in a background job, how does the user know that payment went went through frankly? So normally when you're on a checkout flow, you hit submit, I mean you have to do your credit card permission and submit and the next page you see is thank you, you're successful. I mean in our case we had to add a page that says please wait. What I showed you was a 95% off. On average like payment costs in the case like a second. So the worst like on average like the users will stay on this page for a second and we just refresh, we pull the page. Entertainment costs and jobs sets a flag on our order model called like payment successful or whatever and once that flag is there we send a person to the receipt page. We also I mean if there's an error we also just send them back to the process. But yeah you need to add like a spinner page which is not ideal but like if you use Ajax you can make the experience better by like just making it a spinner on a button or something. Yeah that's a good point. So caching of external dependencies. One that's here, one that's a bit obvious is the ship rates. So we cash the ship rates with basically you're shipping from point A to point B with a given cart the prices will always be the same right? Or at least for a certain amount of period. So we cash ship rates for six days, it's not six days, it's six hours and it's just like it's a net cash. You just like the key is probably like the addresses cashed and like the cart content. I guess I was going to ask, so let me turn it similar. So it's always hard for you about minor dependencies because none of them really feel minor when you're in the middle of it. And the shipping rates for riders is like the perfect example because you can't calculate that, you can't calculate that, you can feel like you can't really fill in the cart. I don't know, are there other other ways to respond to some of these other minor dependencies? So actually I'm very considerate of these major dependencies. The minor dependencies are like honestly like the session storage, was there a real use case at Shopify? So we had this before filter that was trying to load your customer ID from the session store and when the session store goes down, when all of Shopify just goes down, that one is, you don't think about it, right? You just assume that they work. For the ship or any provider, so you just want to think about something, if, so if FedEx is down, we can't provide any rates to people, right? At all, like people can't check out, that's pretty bad, right? So what we do for that is we have so much data in our databases that we try to be smart about it and try to estimate the ship rate. We can look at like did anybody order this exact part to say this state from this state yesterday? We kind of, we can approximate the amount and use that but again like we're really at the mercy of these external services and there's really nothing we can do besides providing all that. If you guess wrong, you just eat the cost of it. The merchant would, I guess, yeah. I mean you can be smart about it and like maybe add a dollar or something, I don't know. Yeah, it's tough. So what we did for last Black Friday is it really just very simple. We just wrap everything with it and you open the circuit, all of you YLs go away. Interesting part here is that some of our merchants have multiple shipping providers, so even if we kill off FedEx, like they still supply with some rates, sorry, same thing for like hand gateways, a lot of our merchants accept two hand gateways, primarily like Stripe or like a credit card and Paypal. So if Paypal had issues, we they'd still be able to pay with a credit card but yeah but the UILs do go away. Yeah. Yeah, okay, I'm just going back to this. Okay, well this would be a nice place to put StatsD. You can record this with StatsD and then have like alerts that say, oh this session story just happened troubles. This is a very simple example but like you could, I could totally see putting like a StatsD account here and having an alert that says like if there is more than like 10 errors on sessions with like a minion far off an alert then someone can look into it. You probably shouldn't do that because that's like a rescue and like close your eyes. Any other questions?