 Thank you for coming to my talk. I, that's very kind and generous of you to listen to me talk to you about things. My talk is called, don't forget the network, your app is slower than you think. I'm going to talk about, I guess, things that you probably haven't thought about yet about how people use your application and about ways that people using your application are having a worse time than you think that they are. I'm sorry. I don't really know of any good way to talk about this except by probably making you feel bad for your users. So brace yourselves and you'll be fine. Before I get to that, introduce myself, my name is Andre Arco. I'm indirect on almost all the things. That is an avatar of me that, now that I'm looking at it, that's one avatar old. I'm sorry. I'll get it fixed by the time I post the slides on speaker deck. I wrote a book called The Ruby Way, the third edition. I co-authored the third edition of The Ruby Way. It's actually pretty great. I learned Ruby from the very first edition of The Ruby Way and it was my favorite book except that I couldn't tell anyone to use it because it was about Ruby 1.8. And so I updated it and it covers Ruby 2.2 and 2.3 and if you buy it in a couple years you can use it to prop up your monitor and make it higher like I do with my copy of The Ruby Way second edition. I work at Cloud City Development. We do mobile and web application development from scratch but mostly what I do is join teams that need someone really senior to help with their Rails app or their front end app. I've done a lot of Ember stuff. And I guess if listening to this talk makes you feel like you could use someone to help you feel less bad, talk to me later. That is literally my job. I work on something else you may have heard of called Bundler. I mean I've worked on Bundler for a really long time but it's been a really great experience to work on open source and to kind of interact with every aspect of the Ruby community. People do things with Bundler that I would never in a million years have imagined that people do with Ruby and then I get to help them try to solve their problems. And if we've put a lot of effort into making it I guess easier, I don't know about easy, but easier to get started contributing to open source through Bundler than a lot of other open source projects. And if you're interested in contributing to open source, definitely talk to me later or tweet at me and I would love to help you start contributing to open source. The last thing that I spend time doing is called Ruby Together. Oh I'm even wearing a shirt. And Ruby Together is a non-profit trade association for Ruby people and companies that pays developers to work on Bundler and on Ruby gems so that you all can run Bundle install and it actually works. And without companies and people giving us money, RubyGems.org just wouldn't stay up and you wouldn't be able to bundle install because we have to work on it every week to keep it up. It's servers. It's software. It all breaks all the time and the only reason that we're able to keep it working now that there are so many people using Ruby and using Ruby gems is because companies like Stripe and Basecamp and New Relic and Airbnb are willing to give us money so that we can pay developers to make sure that it all works. We haven't let RubyGems.org go down in the last year, which is super great, but at the rate usage is going up, we need more people to give us money. If you are a manager or if you can talk to your manager about Ruby Together, that would be awesome. So the network and how your app is slower than you think. I guess routing is a thing that your app has, even if you didn't think that it does. I guess at one point there was a very widely shared article on RAP Genius's blog about how Heroku's router was a sham and everything was awful. I guess unfortunately, whether you're on Heroku or not, your app has a router and it's probably making things worse than you think they are. So let's talk about how that is and why that is and what you can do about it. So routing, what I mean is the part of your application's infrastructure that takes the request from the outside world and load balances it or forwards it or somehow gets it through your infrastructure until it finally reaches your Rails app server. And then your Rails app server does some stuff and tells New Relic, hey, this took 45 milliseconds and then it has to go back through Nginx or HAProxy or Nginx and HAProxy or whatever it is that you use back to the outside internet and then across the entire outside internet back to the user who was trying to find that thing out in the first place. So how exactly does this work? Like maybe you haven't thought about this. I totally don't blame you on your laptop. This is a non-issue, right? In development, this is routing, you talk to your app. It's great, actually. Unfortunately in production, you need more than one app server and people are coming from a lot of different places. And so this is just like a generic Rails app, not every Rails app will look like this, but almost every Rails app looks like this. You have some outside level load balancer, you have some inside level. Here's how we split requests up across all of the Unicorns or all of the Pumas or all of the whatever's. And every single one of those lines adds time to what your users see that you never saw while you were working on the program on your laptop. So question time. Raise your hand if you know how long your routing layer takes. I've asked this question in various different talks about eight times. I totally expected no one to raise their hands. I've literally had one person ever raise their hand. Eight talks, that's probably like closing in on a thousand people now. I once asked this question at a DevOps conference and zero people raised their hand. Like I don't expect you to know the answer to this question. But it's actually a really important question to ask because your end user's experience is 100% directly impacted by this. Like someone who goes to your production app and tries to use it experiences 100% of your routing layer twice for every request that they make. And like, is it a long time? Who knows? None of us. And then on top of that, not only is there this question of how long does it take in the perfect case from the time they make the request to the time your app is processing it and then from the time your app stops processing it to the time they get the response. None of that time shows up in your nice new relic graph that's how long this took. Zero of those milliseconds are included in that number. So you can look at the number and be like, yeah, we answer all our requests in, I don't know, what's a good RAILZ number? 250 milliseconds. I feel like that's a pretty common one. But like, how much time do you need to add to that before you know how much your users are actually experiencing? How do you even find out? And then once you find out, what if too many requests come in at exactly the same time? Just having that routing layer where all of your requests come to one point and then they fan out across other points, this was like the main point of that rap genius article was that Heroku uses, and honestly, like there's nothing else that you can really do that makes sense. You just kind of randomly assign them. Like, well, here's one for you, here's one for you, here's one for you, here's one for you. And the problem is almost all RAILZ apps have like some requests that take 10 milliseconds and some requests that take like a second and a half. And when you're just throwing them out at random to every server that can possibly service them, unfortunately, statistically, it is very likely that you will end up with two horribly slow requests stacked up behind each other, and then the really fast requests start to stack up behind those. And it isn't very long before you like see a 30-second timeout and you're like, that makes no sense. That request, New Relic says that request takes 10 milliseconds. Why would it hit a 30-second Heroku timeout? And so it's not perfect, but you can at least start to get a little bit of visibility into this using a New Relic feature called QTracking, where you have your load balancer set a header that says, I got this request at this exact time. And then your app server says, well, I didn't get this request until this much later time. And then New Relic can add a thing to your graph that says, well, your requests are spending about this much time just sitting around waiting for a server to have availability to answer them. And that can be a completely separate thing that people don't measure that is adding 50% sometimes. I've seen that to the total user waiting request time. And it wasn't even measured. No one knew it was happening. Everyone was just like, that's weird. It seems to take a lot longer to get a response than New Relic says it takes to make the response. I wonder, you know, so, so ultimately what I'm trying to impress on all of you all is that the overall request time is not the number that Skylight New Relic pick a service. I don't really care tells you your request takes like that's a good number. Measure that number, pay attention to that number. If that number changes a lot, you want to know why that number changed a lot, because that's really important. But don't think that that number means that that's how long people are taking to, like, get the results of your app running. Don't, right, exactly. It's not the time that you measure that your app takes to run. And honestly, even that queuing track that I was talking about with New Relic, that requires that those clocks on the load balancer server and Ruby app server be, like, synchronized so precisely that they can measure milliseconds accurately. And, like, it's very easy to end up with clocks that are milliseconds off, and then your measurements are off. And so what you want instead is a holistic measure of how long does it actually take to be a person on the internet, say, hey, Rails app, I want to know a thing, and then for the Rails app to say, okay, here's your thing, and then it arrives back. So the strategy that I have actually had that is really successful here is to deliberately create a Rails controller that returns an empty string, and then set up a service like RunScope or ThousandEyes or one of Pingdom even, like, there are services whose entire reason for existence is so that you can make requests to your own stuff from all over the world and find out how much delay your overall infrastructure adds to your application. And if you have a Rails app that returns an empty string, I guess, honestly, you could even do, like, a rack middleware that returns an empty string because New Relic measures the, like, Rails framework overhead, right? So you just want to know about all of the time up to the time it hits your Ruby app and all of the time after it comes out of your Ruby app. And so you can use one of these monitoring services to say, this is, like, the weather report for our users around the world. And honestly, like, I've worked at companies where, like, 60% of their traffic was the U.S., but for no particularly apparent reason, 35% of their traffic was from Brazil. And, like, then you really care a lot about network conditions changing and meaning that traffic to Brazil got a lot slower today. You should figure out why that happened and maybe think about setting up a CDN in Brazil. Because, like, if your traffic numbers are relevant to your business making money, they almost always are, this matters a huge amount. And right now, chances are good, like, nobody has any idea what they are. Are they bad? We don't know. Are they great? We don't know that either. Maybe they're great. Like, honestly, if all of you go home today and start monitoring these numbers and their fantastic numbers, I will be extremely happy for you. Based on past experience, unfortunately, they're probably not going to be that great. But knowing what they are is way better than having no idea that they exist. Very closely related to things taking longer than you think they do. Let's talk about servers. So I'm assuming that if you have things deployed, you have servers. This seems like a good bet. Let's talk about what's happening on your servers. Stuff, right? Like, you buy them and you rack them or you rent a fraction of one or, I don't know, you rent a fraction of a fraction of a virtual machine that is a fraction of a physical machine. It happens, right? Like, you end up with a piece of a computer and some stuff is happening on that computer. And even if you bought the computer yourself and racked it yourself, it's still running a ton of stuff and you have no idea what that stuff is. And I'm not going to tell you that you need to know what all of that stuff is, but I am going to tell you that it's really important to know how that stuff is impacting the thing that you do care about, which is your users and experience. And so a big thing that impacts this, whether you use Ruby or Python or Node or Go, you have a runtime for your application, right? Like, even Go has a garbage collector and a framework that all Go programs run inside. And what that means is your application sometimes isn't running while your program is running. And when that happens, your code isn't running, your instrumentation isn't running, and you have no idea how long that took. So if the garbage collector runs and your entire application just stops for a while, how do you know that that happened? How do you know how long it took? Like, you can't write. It's really hard to write code that measures time that your code wasn't allowed to run. So based on real-world usage, definitely Go and Java and Ruby all have like garbage collection pauses where like execution of your code just, nope, hang on, wait, gotta collect some garbage. Okay, that's good, you can keep going. And Ruby, I guess, recently has added a thing called GCProfile that at least reports after the fact how long garbage collection took, which is awesome. But there are more reasons than just garbage collection that your code could end up paused. And so what you actually want is some way to say, I can tell that my code stopped working, stopped running, right? It was still working, but it stopped running for a second. And then it started running again, and how long was that? And there's this, I learned this trick from somebody who works at PaperTrail, Larry Marburger. And I think he got it from some of his colleagues at PaperTrail, it's super clever. What you do is you start a new thread with thread.new. This is like Ruby specific, but you can do this in any language. You start a new thread and then you say, what is the time? Sleep 1, what is the time? And then you subtract them and you send the difference off as a metric. And if your code stops running, Sleep 1 will take longer than one second. Little known fact. And so by monitoring how much wall clock time passes while a thread in your application is calling Sleep 1, you can accurately graph how much overhead the surrounding interpreter is adding to your overall execution time. And I have definitely seen this happen where you're running a Ruby program and you're like, that's weird. It seems kind of slow. And then you check the like, how long does it take to sleep for a second graph? And you're like, oh my God, we're spending 150 milliseconds doing something that's not running my program every second. And sometimes that means you have a memory leak. Sometimes that means that machine just got into a really bad, weird state. But at least then you know, and at least then you know that it's exactly that app server that's having this problem and all the other app servers are fine. Super useful. I guess very, very closely related to this is the virtual machine that you're, now that we're talking about interpreter lag, right? You're probably also running that interpreter inside a virtual computer. And Amazon, DigitalOcean, Paroku, Engineer Art, OpenStack, you're either running on a VM or you're running on a VM inside of a VM or maybe even if you use Docker, a VM inside of a VM inside of a VM. Hooray. And as you can imagine, this is yet another way to have weird times where your code doesn't run and you don't actually know it because your code literally couldn't run. And it's even worse than that because sometimes you'll end up with resource-specific contention, right? Like in a VM, what if you're on a VM and one of your co-tenants suddenly, like your co-tenant is running a memcache server? And so it's just like all the memory I.O. is going to your co-tenant. How do you even know if that's a problem? What if they're doing something really like storage heavy and that means you can't get I.O. anymore? And so there's like, at a minimum, the resources that you may care about are CPU and memory and disk, right? And network I.O., I guess, flavor number two of network of I.O. And you don't know when you get a shiny new empty VM, like maybe everything is great and maybe that machine has basically no network I.O. available or has basically no memory I.O. available because of co-tenants that you don't know exist. And so Netflix has a really clever way to check for this, I guess. They've written about it kind of at length and what they end up doing is Netflix spins up a new EC2 instance and then before deploying to it, they shove a giant pile of benchmark suites onto it and they run the benchmark suites and then they compare it to what they've decided is acceptable performance for that price point on EC2 and if it's below their acceptable benchmarks, they throw away that VM and get a new VM and then try the benchmark suite again and then throw that one away and then get a new one and eventually they hit an instance that meets their criteria and they said in their paper that they have observed almost an order of magnitude in difference in performance at the same price point because Amazon sells both the newest generation of hardware and two or three or sometimes even four generations old as the same VM, very large air quotes around same and then you have to deal with like co-tenancy issues where you may have a VM on a very old, very heavily contended physical machine, you may have a VM on a brand new, uncontended machine and like so for Netflix, they said that doing this and I'm probably gonna get these exact numbers wrong, I'm sorry, it's been a while since I looked at that paper, they saved something like a third of their overall server costs by doing this benchmarking and only accepting VMs that met their minimum criteria because of the amount, like they have a static amount of traffic that they need to serve but they got machines that were more capable to do it at the same price point and so they just needed to spin up less machines and pay Amazon less money. I guess, so you're probably not Netflix, this probably doesn't matter to you that much but it is at least something that you can be aware of when you're like man, 10 servers seem to be enough to serve this traffic last week, right? And then specifically like do you know what it is that your app cares about, like it is entirely possible that your application is in completely CPU bound and you honestly don't even care if your co-tenants are doing tons of IO but you care a lot if your co-tenants are doing video encoding, maybe it's memcache server and you're just memory IO bound, maybe it's Postgres and you're everything bound, Postgres just wants everything but like this is the kind of thing that actually matters and knowing this difference can be a really big difference in how many servers you need, how much your servers cost and as you get bigger and bigger, like one third more performance for the same cost becomes a larger and larger number that is worth putting more and more effort into getting. So now that I've convinced you that you need to measure all of these things that you weren't measuring before, let's talk about metrics. I guess good point about the Ruby community, they're pretty good at measuring metrics. That's great. New Relic makes it really easy. You're gem-installed New Relic. Hooray, metrics. Metrics are really important. Tracking things, as we were just discussing, is really the only way to know what's happening. Without metrics, your production is kind of a black box and you're like, oh, things aren't as good as they were before, I don't know why or probably even how exactly because I wasn't able to measure, didn't know how to measure the things that matter. So I saw really, the first time that how important metrics are really hit home for me was in 2009 at GitHub's first code con. I saw a talk by Coda Hale called Metrics Metrics Everywhere and kind of the underlying point of his talk was that the reason that all of us have jobs and the reason that all of us write software is to deliver business value, whether that's to our bosses or to customers or to clients. Like most of the software exists for the purpose of delivering business value, especially if you're getting paid to write it. And if you can't measure it, you can't tell if that is what you are doing. So having said that, you're probably not super impressed by me telling you that metrics are important, right? So you do need to know what's going on. There's a catch. Once you have metrics, you have a tendency to become convinced that you now understand what is happening and I don't blame you. I do this too, right? It's like a human thing. You're like, oh, I'm measuring a thing. Now I understand it. Just like being able to see the speedometer does not tell you how the cars transmission and engine work. Being able to see a metric on your application does not tell you about how and why it is working. It just tells you something is very different than it was before and now you need to figure out what it is and why it is different. And a very common problem is that having metrics, having some visibility makes people think that they have total visibility and that just isn't how things work, unfortunately. So at the end of this little bit about metrics, this is probably going to be you instead. I'm going to talk about some ways that metrics actively mislead you and the biggest thing that causes this kind of misunderstanding driven by metrics is averages. When you have a lot of metric information, especially if you have a bunch of app servers, the easiest way to distill that down into something that you can quickly communicate is to take the average. A super good example of this is the way that New Relics dashboard, when you first open it, it's like here's a giant number. This is the average of all requests across all app servers. So you see those graphs, you see the numbers going up and down, you're like, great, now I know what's happening with my app, right? Unfortunately, no. Brains are really highly developed, carefully tuned pattern matchers. This is how humans can see Jesus in toast. This is how you can see an average and think, I know what that means. So your brain's immediate extrapolation from an average is probably what's called a normal distribution. There we go, normal distribution, right? You think, oh, the average, it's gonna be right at the top of that. This is often called a bell curve, and it's what happens when all of the inputs into the graph are generated by a random function. Tell me if you think that your app is a random function. I mean, maybe it feels like it's a random function. But your app is not actually a random function, and the practical upshot of that is that it doesn't look like this at all. This is a more realistic graph of what might be producing an average that's right at the zero point on that graph. To kind of drive home how wildly misleading averages can be, let's look at a bunch of real life graphs at the same time. So this is a whole bunch of different measured metrics from a real life thing. It was a MySQL benchmark, doesn't really matter what it is. From left to right, it's collecting the number over time. So near the left, it's like the things that were fast, and then as you go to the right, it's things that were slower and slower and slower during the benchmark. And so the small black vertical lines that you maybe can't see very well represent the average for that particular line. So to make it easier to see, I'm gonna line up all of the averages of this same graph. So not only do any, none of these look like a bell graph, right? Bell curve, not a single one of these looks like a bell curve. Worse than that, most of them have zero actual data points at the average line. It's really characteristic to have a very large number of points either clustered together in the fast zone or spread out over the long tail of slow things. But if you look down near the bottom, some of these lines don't even have a single result that's near the average line. And so if you're looking at New Relic, you might not even have any requests that take the amount of time that is the number of milliseconds that you're seeing in giant font on your dashboard. This is the problem of averages, right? Unless your metrics are being generated by a random function, the average is going to actively mislead anyone who sees it. There's a great quote about this from a tweet by a friend of mine, that's a faric. Problem with averages, right? On average, everyone's app is awesome. And so again, averages, I guess the single good thing that is really great about averages is that they can tell you that something changed. You can say, oh, my average was this before, but my average is this now. That's weird. The problem with averages is that they can't tell you what changed or how it changed. And it's actually possible to get that information out, and so I'm going to show you how to do that. So while averages can tip you off that something changed, so here's a graph of an average. And as you can see, things are taking about less than 100 milliseconds, but that effectively means there could be tons of things happening that take about 100 milliseconds, or there could be tons of things happening that take 10 milliseconds and tons of things happening that take three seconds. It's an average, so there's literally no way to know. One way to get around this is to graph the median rather than the average. The median is the number that was bigger than half of the numbers and smaller than half of the numbers. The great thing about the median is that you are sure that it actually happened, right? The average may or may not have ever actually happened, but the median definitely happened. And so if we add the median to this graph, you can see on the purple line, we now actually know more information than we did before. Most half of the values are actually very, very fast. It looks like around in the 10 millisecond range, maybe 20 milliseconds. So even though the average jumped all the way up to 150 milliseconds at one point, at least half of the requests were happening still equally quickly. They didn't slow down. That tells us that since most of the requests didn't slow down, this wasn't like an application-wide change, right? We didn't suddenly get a really slow load balancer. We didn't see really, right? This wasn't a really a network switch problem where all of the traffic was impacted. The next thing you can do is graph other places. The median is the 50th percentile, right? Half is below, half is above. Start graphing the 95th percentile. 95% was below, 5% was above. Here you can see that the slowest 5% of requests got dramatically slower, more than 10% slower than, 10%, more than 10 times slower than the median. And that's what dragged the average up. Oftentimes even better than the 95th percentile is the 99th percentile. This gives you a good idea of what one, it's one out of 100, right? So this is actually a pretty good indicator of what the occasional slow request looks like, right? Well, I had to rescale the graph. And the slowest 1% of requests are now clearly the entire reason why the average tripled. The median stayed exactly the same. The median is now a flat line. And that slowest 1% is probably some single specific controller action that you now need to go find and figure out what exactly happened to that specific single thing. And so just by graphing the percentiles rather than the average, we can immediately rule out about half of the possible problems that made our average slower. And it works the other way around too. If you look at the graph of the 99th percentile and it isn't dramatically different even though your average is higher, then you know not to look for a single controller action. You know to look for a systemic problem. Aggregate graphs, this is another really common thing where aggregation is a fancy way to say, I got the many versions of this metric from many servers and then I averaged them. So here, again, here's an average graph. This one happens to be taken from the actual Bundler API. And it is a graph of the trick that I mentioned where you call sleep 1 and then you see how long it took. So the number that we're tracking here is milliseconds and it went from taking two milliseconds to, you know, one second plus two milliseconds to taking one second plus five milliseconds. And that means that garbage collection pressure must have been twice as bad, question mark. We don't know, this is an average. And you can improve this with breakout graphs. If you are collecting a number from 25 machines, put 25 lines on your graph instead of one that will mislead you about what all of the different machines are doing. Here is a breakout graph of the same data. Kind of like I was mentioning before, with the breakout graph, we can see, holy crap, this had to be rescaled. One of the machines started taking 35 milliseconds per second to sleep. But all of the other machines were basically fine. And so we wound up resolving this issue by just killing the one dino that was having trouble and restarting it as a fresh dino. But we didn't have to nuke all of our dinos. We didn't have to write this narrow down the problem immediately just from having a breakout graph. So, do it. Visualize your data. Here is an example of why visualizing your data is so, so, so important. These are some different data sets. Each orange dot is a single entry on that data set. Can anyone guess what the blue line is? So, that average. Average. Average. Average. It's actually even worse than that. The average of why. It's actually even worse than that. The average of why is exactly the same on every crack. It's actually even worse than that. All four data sets have the same average of x average of y variance of x variance of y correlation of x and correlation of y and the same linear regression. Actually graph your data and then look at it because the numbers of the averages and the variances and the correlations and the linear regressions don't contain any of the information about what is different in those graphs. One final note. A lot of people that even talk to me about how awful averages are, I then am like, oh, hey, so how do your alerts work? And a lot of people have alerts that are set up to only talk to them after the average is bad. And as you can maybe guess, by the time the average is bad, it is too late. Definitely break out your alerts as well as your graphs. Right? You want to know when the first server went down, not when the average of the servers is a down server. Right. So ultimately, I really just wanted to let you guys know that the network is a part of your application. Most people don't think about it because they don't have to interact with it in their day-to-day development on their own local machine. And after you have deployed your application, it is really user experience that matters not how many milliseconds your Ruby app spends running code. That's it? Sure. So the question was if you don't alert on averages, how do you prevent continuously alerting, getting alert fatigue and then not noticing that something actually bad happened? And the question included a note that there is no silver bullet for this. And unfortunately, the answer is there is no silver bullet for this. So the best plan that I have ever seen from the best operations people that I have worked with is figure out what the baseline of your system when it's functioning is and alert when your system is not that. That means figuring out how many requests you're successfully serving per minute and alerting when it deviates from that more than 50%. Figuring out when it's normal to have that deviate and then not alerting on that. And it's actually a ton of work because every single application has a completely different norm. Some Rails applications, they serve like 50 requests a minute and that runs their entire profitable business. Some Rails applications serve hundreds of thousands of requests a minute and they're not profitable yet. You need to figure out how it is that your metrics look when your company is functioning both software-wise and company-wise. And you need to kind of alert when it's not the thing that gives you the indicator that things are okay. That's really the best advice that I have for you. Five seconds. Any more questions? Alright, I'm happy to talk about this stuff later.