 Okay, well welcome everyone. Thanks for coming. I know this is the end of a really long day Last day of the conference. We're all a little tired And we do have to save our energy for Aaron's keynote. So it's always a lot of fun. So don't miss it But I'm glad you're here. We do have some good things important things to talk about so So to start off who is this person up here my name is Daniel I've been a ruby developer for about 12 ish years or so other languages for longer than that Spent most of that time in early-stage startups really small companies Currently though I work at a different kind of company Yeah, my current company is such a small company So it's a bit careful about its messaging. So some quick notes. I'm not speaking as their representative This is my talk and my views are my own not my employers That said my employer does own my code And for whatever reason that actually includes examples on slides So I should mention that the code samples in this talk are copyright Google and license under Apache 2.0 So that's stuff out of the way I have to say that I was really impressed by David's David's opening keynote. I Like what he was saying about the importance of belief systems Those underlying values that should anchor us and anchor our our decision-making The importance of recognizing what those values are and how they define us So yeah, I really appreciated his message and one thing that it made me think about Was what are the values that have rubbed off on me? Through all the various small companies and not so small companies that I've worked for One of the common themes I think across all these organizations Big or small especially the good ones Has been the importance of measurement of data. I remember my my first rail startup was back in the beginning of 2007 or so that we launched our first rail sites and And So 2007 we were on rails 1.0. So good old days and I remember a few months after we launched Our CEO was on national TV live on national TV for something totally unrelated to our company So he was he was on this on this interview and he happens to Accidentally drop the name of our site on On national TV, so it was totally unplanned. He didn't intend to do it but it was on national TV and The show was actually pretty popular at the time. So as you might guess within a few minutes I get a page from our monitoring system and Yes, our site started to get really sluggish Took a look at our logs and we were seeing multiple orders of magnitude traffic spike so This was This is of course, you know, it's both a good thing and not so good, right? but it was in 2007 and We didn't have these these nice auto scaling cloud services that we do now We were on physical servers in the Colo at the time. And so I Spent the next couple of hours logged into those servers. I was trying everything I could think of To increase our capacity. So I was tweaking our load balancers I was you know disabling features. I was spending it more Mongrel's per machine We were running under Mongrel 1.0 at the time. That was the state-of-the-art at the time nothing really worked we were just we were just extremely laggy and I Guess eventually after a few hours traffic kind of settled down a bit and in the site recovered But in the meantime, we were flailing around in the dark Now later when we did our post-mortem investigation We determined among other things that our Mongrel's were actually memory constrained So we had we had these in memory caches within each Mongrel process and as traffic went up these caches would get squeezed and we start thrashing so, you know, we just You know, it just started pulling over so my attempt to fix the problem by spinning up more Mongrel's Was just squeezing memory even more and making the problem even worse The worst part of this was I couldn't even tell that this was happening We had a caching strategy. We thought it would work, but I Didn't really have good data on its actual behavior in production And as a result, we weren't really able to respond well when we ran into a production issue So it's crucial to have real data On what your app is actually doing in production when it's under production load This is true Whether you're just starting out as we were or whether you're one of the largest most successful companies and have the largest Products some of the largest products in the industry. You have to know what's going on So at these small company that I that I work for now, we we measure everything And if you're if you're just starting off if you're just launching your first app, you also need to measure everything So how do you do that? Well to begin with the obvious There are a number of excellent services out there, obviously that do system monitoring application monitoring This screenshot here shows Stackdriver, which full disclosure is actually developed by my current small employer But there are a number of other companies and a number of other products that are very very good And if you spend some time down the hall at the exhibit, so you've probably met some of them performance monitoring services error monitoring Who here uses some kind of monitoring service in production? Okay, that's that's good. This seems like a majority Yeah, those services generally do a really good job of collecting data and providing visualization and analysis tools That said There will be times when you do need to customize when you need to measure something that a general purpose tool won't give you out of the box So this afternoon what we're gonna do is we're gonna take a peek under the hood at some of the techniques that these monitoring services use To instrument your application and you'll see how you can use some of these techniques to perform your own monitoring Or you can customize the existing monitoring for the services that you're using to fit your applications needs So here's what we'll cover We'll learn techniques for instrumenting your app We'll look at how to gather data without disrupting your running apps behavior in production Then finally we'll discuss what sort of things you should be measuring So let's talk about instrumentation This is an instrument It's an old electro retina graph machine. It's used to diagnose a variety of retinal problems Modern versions of this instruments are a lot smaller. Maybe a little bit less scary But old or modern these machines do actually have one thing in common There they have electrodes and those electrodes need to be in direct contact with your eyes in order to measure things So generally a patient is given anesthetic eye drops So a little bit scary But necessary and similarly when you're collecting data from a running application Measurements do need to be in direct contact with the code that's being in that that's involved and that's the job of an instrumentation API It it's To to collect data from running application It just it gives you the ability to inject Measuring code at key points in your application to put that measurement code in direct contact with the code that's being run Rails apps can use instrumentation API called active support notifications And see how this API works. Let's take a look at an example So remember my imploding cache How would you know if your cache is working the way that you expect? In my case caches were running out of space and when that happens you'd probably see a lot of cache misses Indeed your cache hit rate is a good indicator of the health of your cache in general So let's measure it and we'll do this by using notifications To count the cache hits and the cache misses. So here's what that would look like Whenever rails reads from a cache It calls this method active support notifications instrument This call takes a measurement of the caching code. It notes that the cache was was read records how long that took And it records other information such as the cache key that was read and whether it was resulted in the cache hit or a cache miss It gives all that measurement data a name in this case cache read dot active support So rails is actually already doing all of this for us. We say it instruments. It's cache code So in your app What you can do is you can subscribe to this measurement and you do this by calling active support notifications subscribe So you give it the name of the measurement that you're interested in and you give it a block and whenever that measurement is taken Notifications will call your subscriber block and give you a chance to do something with that measurement data Now I won't go into all the details of the API. That's something that you can look up on your own In this example All we're doing is we're taking whether the cache hit Just right Yeah In this example, all we're doing is that taking whether the cache hit or missed and we're logging it So now we have a log that might look something like this After all afterwards we can run simple tools like grep or word counts on our log and analyze that data and get useful statistics Now rails actually already instruments a number of things for you in addition to cache read So for many measurements, all you have to do is subscribe to them But you can also call instrument yourself And instrument your own code and this is actually particularly important when you're writing your own rails plugins It's a good idea to instrument your plug-in so that applications that use your codes can measure its activity and its performance All right, so so far we're receiving notifications of all our cash hits and cash misses We're logging them so we have an overall measurement of the cash hit rate This is already interesting data and as you can see it's very easy to get just a couple lines of code But to make it more useful we sometimes need to collect a bit of context So let's take a look at another old medical instrument This is a vintage x-ray machine actually from a from about 1900 or so That time x-ray machines were a little more than a tube of radioactive material that the doctor was kind of positioned over the patient so Obviously many of the early experimenters with x-ray imagery weren't really aware of some of the hazards Radiation exposure There so there were some illnesses and some deaths in both patients as well as doctors and researchers around this time nowadays of course When we x-ray we're very careful and we're very specific about targeting exactly That the part of the body that we need to measure and this is something that you also want to do when you're instrumenting your app Most apps have a number of controllers and actions Actually was just talking to someone a few days ago whose company had a monorail with around 400 controllers most of us don't have apps that large but Still you often want to focus down your measurements a little more than Then measuring that entire application So one particular controller or maybe even one particular action So what if I wanted to measure cash hit rate for just one particular action that was interesting So the first option that might come to mind is okay Let's let's turn notifications on at the start of the action and turn them off at the end Trouble is notifications are global They apply to all threads at once and so that will include threads that might be running other requests A lot of us are probably on multi-threaded web service at this point and so this won't work in those cases So what do we do? So here's the technique that you can use Let's start with our existing cash subscriber. So right now it's logging on every cash read Regardless of which action is being executed Now we can determine the action by subscribing to a different measurement In this case the the start processing event This is a measurement that's taken at the start of processing a request by action controller It captures various information such as which controller and which action is going to be executed So we can determine here whether to take a cash measurement Now we need to communicate that information to our cash read block All right, and we need to do that on a per thread basis. So we can't use a global variable So for that purpose Act of support provides a per thread module attributes So you might have seen module attributes like matter reader matter accessor Normal module attributes are basically just global variables and they're attached to a module But there's also this this version that can have a different value per thread It's actually just a convenience wrapper around the thread local variable if you're familiar with that from Ruby But using this pattern we can now communicate between our subscriber blocks on a per request basis Obviously, there are still some caveats. These are still globals even if they're thread scopes. So use them with care For example number of web servers. I think Puma is one of them Will actually reuse threads across multiple requests So make sure that this kind of data actually gets cleaned up or reset to between requests It's not a perfect solution, but it's good enough. I think for this purpose So now we have a technique for measuring hash hit data and for doing so for a particular action Let's take it a step further Here's another interesting looking instrument. This one's from about 1960 It's hard to see here in this photo, but the subject is wearing contact lenses And they have miniature lamps connected to them. So it's actually able to to capture eye movements and eye reflexes At the same time the whole contraption moves and the motion causes visual illusions So the device is actually gathering multiple different sources of data Combining eye measurements and machine motion and it's using that to study some of the mechanics of visual perception Combining information from multiple sources Correlating that information something that you need to do quite often when you're instrumenting your application For example, it's helpful to know your cash hit rate But it would also be good to know how much a cash hit actually buys us is the request latency Actually any different if you if your cash hits or misses So again, here's the code that we were just looking at we're determining which action is running And then measuring whether there's a cash hit or a cash miss and then we log that information We can get the request latency in a number of different ways In this case, so we can subscribe to get another event the process action events This measurement is taken at the ends of an action by action controller So it can provide information about what happened with the action such as the HTTP result or the latency So in this case in this example, we've added the controller. I'm sorry We've added the subscriber to the process action event And we're logging the request latency. So we have two log lines now Logging the cash hit or miss logging the latency To make this more useful. We should combine these two pieces of data cash shit or cash miss and latency Log them together and once again, we do that by using a per threat attributes to communicate between subscribers so now those two pieces of data are logged together and When we run this in production, we might get log data that looks something like this So you can see there's kind of a clear latency win when we get a cash hit But it looks like our cash hit rates might be a little bit lower than we would hope So now we have actual useful information that we can apply towards Improving our cash behavior or as my PM friends at my small company like to say we have actionable data So we've gone through an example using notifications There are several other instrumentation APIs that might be useful There we go. One of the simplest ways to instrument an app is to use controller filters Here's a simple around filter that just measures the latency of the action So this is the easiest way to gather really simple information about a request as a whole You can also write a rack middleware This is useful if you want to measure the behavior or latency of other middleware because you can insert your middleware At any place in the middleware stack You can also use this to install instrumentation code in other frameworks Sinatra, Padrino, Hanami Any non-rails frameworks that might not use active support Finally this trace point Now I think of trace point as kind of the sledgehammer of all instrumentation APIs It's a it's a bit different that it's not part of a framework not part of a web framework It's actually part of the Ruby virtual machine It works similarly. It's you you provide code that gets run when certain events happen in this case at the language level so events like an exception was thrown or Method was called or a return from a method or even move to the next line in the source code So it's it's a very powerful API. It's a bit specialized Probably more commonly used for debuggers than for monitoring and my small company We actually use it pretty extensively to build a cloud-based debugger product Because it lets this instrument down at the source code level, but it's probably not something you use that often for monitoring It is extremely powerful our however, and that's brings up an important issue So here's another image this device is from 1940 it actually measures brain waves. I think in this case it was measuring brain waves for someone who had gone through some PTSD due to wartime wartime experiences But you know we look at images like this and some of them are a little bit scary, right? We wonder if these machines are safe. We wonder if it's is it gonna observe my brain as they're gonna reprogram my brain, right? So this is an important question when instrumenting in production. It's critical that we can take measurements In production against real traffic But it's also critical that we don't change the behavior of our app in the process We don't reprogram our app because we're taking measurements Safety is incredibly important when instrumenting so we're gonna talk about that a little bit one major component of Safety is that of keeping the latency effect to a minimum So here are some tips for for going about that First of all as we said as we saw Isolate and spotlight the interesting use cases Oftentimes you don't need data from the entire app, but maybe just a few Particular controllers of interest or a few particular actions So isolate those I encourage you to experiment with instrument instrumenting new things gathering new pieces of data However, always go circle back and reevaluate If it turns out that some measurements is not really giving you interesting information after all don't be afraid to delete it I tend to treat instrumentation myself like tests Many of them should live on indefinitely because they're monitoring critical systems But there are some that really are only useful temporarily Maybe because they're part of a one-time investigation that you did Or maybe because you put them in and turns out they they're not actually as useful as you thought they would be If you leave them around too long, they'll just slow you down just like your tests might Might just take longer than you really want So practice making those judgment calls Don't be afraid to spin up new instrumentation, but don't be afraid to delete them as well Sample your data if you can Often you can get away with measuring only one in a hundred instances or one in the thousands Trace points can be particularly dangerous Because many of its events can fire extremely often remember move to the next line in the source code So use it only when you have no other choice If you do need to use trace points, here's a pro tip It's global by default Just like we saw with active support notifications. It applies to all threads at once However, there is an alternative trace point API that you can use That's lets you instrument just a single thread at a time. The catch is it's only available as far as I last time I checked as a C API you can't call it directly from Ruby So it's harder to use but it is available and it's worth investigating if you want to use trace points temporarily for specific requests Finally, of course pay attention to how you're reporting your measurements If you're just logging to the file system That's usually pretty fast But as your app grows You might want to send start sending data to a remote analytic service. You might want to start sending data To your application monitoring service that you're using And so when you do that make sure you don't block on those calls Usually the the API Gems for the for the monitoring services will have a non-blocking Version of those of the clients that you can use if you have to use Just straight HTTP calls be careful of it net HTTP posts because it is blocking it It waits for the for the the response to come back You don't want to do that So use an asynchronous client if you can spin up a background thread Send your data in batches if that's allowed by your API in general just be very careful about how you're reporting your measurements Another element of safety is avoiding side effects Those are changes to your app's behavior. So on one hand, it's kind of obvious, you know, don't modify your application state In in your subscribers But side effects can take many forms a database query in addition to potentially adding latency You should also consider the side effect Might not nominally change the state of your application But it does change its behavior because you're invoking parts of your system that you otherwise wouldn't So be careful about those sorts of things in particular Calling methods on active record models. Some of them might initiate additional database calls. So you want to be careful about that So we've talked a lot about how to measure We have a few more minutes. Let's spend or spend those on what to measure This is an early ECG machine Electrocardiograph machine. I think some people say EKG This is from 1901. It was actually the first Machine that was sensitive enough and practical for medical use I think scientists had had experimented with some of this similar technology as early as the late 1700s But it wasn't until around 1901 That it became practical Again, it's measuring electrical activity the electrodes here are actually in these metal basins Next to the patient. So you would have one foot in both hands immersed in salt solution in these basins ECGs are of course still widely used today Not just to diagnose heart conditions, but also to monitor patients who are critically ill or undergoing general anesthesia This is of course because the heart is an indicator, right? Something is going wrong. It often shows up in the heart's behavior So what are the indicators for your application? What are those things that can tell us when things are not healthy or when something unexpected is happening or Something has gone critically wrong So here's some of those things that you can measure first results The responses that you're sending back from your requests make sure they're in line with what you expect One indicator that sometimes really important is how big are your responses? Are those sizes what you expect? Also pay special attention to error responses and obviously if you 500 then there's there's problems going on But don't ignore your 400 levels 400 error levels as well When I was younger and less experienced I tended to make the mistake of ignoring my 404 error rates Because I was thinking oh, it's just client error. It's not server error. So it's not my problem You might not necessarily want to get paged if you see a 404 spike but you do want to know that it's going on because It could mean that there's a broken link someplace or it could mean someone's trying to hack your site It's an indicator that something is going on. You should at least know about it Your final responses are important indicators, but not the only ones sometimes you have intermediate results That can be useful Another thing is pay attention to rendering Rails templates are really not the fastest thing in the world. I've just come to realize that over you years of using them Rendering can have a significant impact on your app's latency. So measure it not too long ago. I was actually working on an API that Occasionally just occasionally ran really slowly and as we dug into this Turned out that it was actually the JSON serialization That was taking up the bulk of the latency some weird Some weird interaction between what my data looked like and some issues with my JSON library So if something like that is going on, would you be able to tell of course measure your interfaces with external systems? That means your database external API's internal API's micro sorry M word Services API's your caches File system Now out of the box monitoring products will capture many of these things for you obviously, but not everything They typically po focus on the performance of these external dependencies Like how long your data face queries tend to take But what's often missing is your usage Is your application using the external system in the way that you expect? Are you hitting your cash as often as you think you should be? Finally errors and exceptions are important. Don't throw any errors away Don't throw any errors away. That includes expected errors Includes errors that you handle internally in your system and you don't bubble up to your users They can still be indicators an important example. That's often gets overlooked is retries Often when you call an external API you implement retry logic, right? Because network can be flaky Various things can be flaky, but if you retry don't throw that information away Instruments your retry code make sure that that information shows up on your monitoring dashboard Your retry rate is an indicator if you get a retry spike something's going on you want to know about it So we've covered a lot here and really we've just scratched the surface on the number of things But I hope I've communicated the importance of measuring again It's something that my small company right now. We just we we just do out all the time out of habit Getting started is really simple Subscribed to an active support notification as we saw just a few lines of code measure something and log it or Many of the commercial products that are out there have free trials. Go check one out. It's free However, you do it start measuring Nothing else data will be interesting to look at could also save you a big headache in the future So that's all I have. Thank you