 So, my name is Jason Clark as introduced. I work for New Relic and specifically work, have worked on the Ruby agents. So that's the gem portion of New Relic that you might have installed in some of your applications. It's not gonna play too deeply into what we're gonna talk about here today, but it did give me some of the spring word for things that I've learned. That URL there provides some examples of what I'm gonna be talking about today. And specifically, I'm actually really glad that Laura gave her presentation about Docker because all of the examples that I have are set up as Docker files that you can run with Docker. And that's great because what we're gonna talk about is a set of tools that you can use to look at Ruby's behavior kind of from the outside. Specifically, we're gonna look at S Trace, RB Trace, and TCP dump. And these are tools that you may have heard of in some context, but may not have had the chance to dig into deeper. They take a little bit of setup and configuration, and so those Docker files in that repo will let you go offline and go check those out, run them, play with the tools yourself. I would highly encourage everyone to start downloading all of those images off of the conference wifi right now. That would be a really great idea. It's probably not a good idea to go downloading and installing all of those at once, but you can take that away later and play with these. So this talk is set up as kind of a series of mysteries, mystery stories for us to investigate. And like any good story, each of these stories has a moral. And the overarching moral that I want you to take away from this talk is to dig deeper. There are always layers in your software below the layer where you work day to day, and you can dig into and find out what those layers are doing. And that's what these tools are gonna help us do. So let's get started. The first case that we're gonna look at, I call the case of the missing config file. So this case comes out of a startup. It was dealing with this really very secret data stream. I can't even really talk to you about what it was or what's happening, but it was very important that we'd be able to process it really quickly. And so there was a member of the team guy by the name of Carl, a venerable old C hacker who wrote up a library super fast that was able to parse this data stream for us and do the processing. Now, Carl was a nice guy, and so he wrapped it up in some ruby. We had a gem that wraps this native library that he had written to make it a little more palatable. And there was a server out there that we had just one kind of custom hand built thing that did this processing. And of course, as always happens when you have one server, one thing that's responsible for something critical in your infrastructure, well, it imploded. And the unfortunate thing is we had copies of the library, but when we went to install it and run this on a different machine, put it on a different server, we got this. So the first parts just are kind of logging and then a message from Carl that that's not how that's supposed to be configured. We remembered that there was some sort of config file, some sort of setup that was on that box, but none of us knew what it was. None of us knew where these things belonged. And of course, when we went looking for Carl or looking for Carl's source code to be able to figure out what this library was doing under the covers, we came up rather empty-handed. So there's a lot of ways that you could approach this sort of problem, pulling your hair out, crying, screaming, all very appropriate reactions to this situation. But somebody suggested that we try to use Strace. So we spin that up and go execute our data streamer program. You can start Strace and hand it a command and it will go run that. And it looked something like this. Yeah, okay. This is really, really helping, really helping us solve the problem a lot. I mean, this is, if you've ever run Strace or some of these other tools that we're gonna talk about, they can generate a lot of output. But what I want us to look at is I want us to take a step back, take a breath and look closer at what this output is and try to understand and see how it can help us solve our problem. So we're gonna take, first start, one line out of this. So we're gonna take this line out of the middle and let's break apart what that is and understand how it's working and then see how that can lead us to a solution to our problem. So format it a little for size. It says open, it has some parameters, kind of looks like a function call, equal sign and something on the other side. So the first thing is what we refer to as a syscall. And we'll talk a little bit more about that. This is open is the name of the command that's being given. Parameters that are handed to that are shown next. So this looks like a file path and then some sort of bit flag with some descriptors about how we're trying to open that file. And this actually, the more that we look at one of these in isolation, the more it makes sense, right? And when it's just all of the screen full of output streaming past, it's easy to get overwhelmed. But each little bit of it makes a little bit more sense. In this case, open has a return value and that return value is the number of the file descriptor within that process that was opened. And so here this turned out to be file descriptor three when this process ran. So what are these syscalls? Like what is actually going on here? In your computer, there are two sort of major divisions between code that runs. There's the kernel, which has kind of direct access to your hardware. It's the piece that is able to talk to your memory, your disk, it can do everything that it wants to do and it just has all those resources available. The kernel is code that, you know, some very smart people write, it's mostly written in C, C and assembly, it's very low level, very tightly controlled, very carefully written. And the other side of the world is user land. That's where your services, where your applications run. And, you know, the kernel feels a little bit like this about your user code. It doesn't really want you to be doing whatever you want with the system. It doesn't want you to be able to hurt other programs by interactions that you might have. And so the interaction between you and the kernel is carefully mediated by a series of what we call syscalls. So a syscall, in user land, we make a call to the operating system and ask it to do something on our behalf and there's a very well-defined interface for what it does and then it will take that action for us. So there's a lot of things that are syscalls, a lot of process management functionality, a lot of file reading and writing, network writing. All of those are mediated by syscalls at the kernel level for us to be able to get into. You can find documentation on each of these syscalls that you see in the strace output by the man command and giving it section two is where those functions are defined. So like if we said man open two, we can see here the open command possibly creates a file, shows us some of the C headers for what those functions look like and gives us a bunch of information. So we can dig into what this syscall is that we're seeing there. Strace is a tool that on Linux systems allows us to spy on those syscalls as they're happening while the program's running. So that's what all of that output was. That was every call that was coming from our Ruby application and ending up resulting in a syscall to the underlying OS. So how does this help us, right? I mean this gives us all of this output, gives us all these things. I think the important thing to do with it is to look for landmarks, look for the things that we can recognize in the output and what we're gonna be able to find so that we can find the locations in it for us to solve our problem. So specifically there's a couple of key outputs that we get here. So we see our loading messages and these get written out and that writing is something that's going to show up. The operating system is taking action on our behalf and so those are gonna show up in our output. And then at some point after that, we get our message from Carl and that's also written. So if we can look in our output and find the two locations where that writing is happening, we can look at the things that happened between those points in time and maybe get a hint as to where things went wrong. So we dig into our Strace output, use some sort of text tool to go search through it and eventually we land on these lines. So we see a write, so that's the right syscall. It's going to one, that's a file descriptor number that happens to be the descriptor for standard outputs. So that's why it's outputting to the terminal for us. And then we see the data that we're actually writing out those messages that were being written. So we can be pretty confident that our problem in this output is after this point in time. Doing the same trick from the bottom, we look up and go until we find Carl's nice message. That's not how I would configure that. And so now we've got our two points that we can bisect between. And it greatly reduces the surface area that we need to search through to try to figure out what's going wrong. And in fact, looking between those two points, we ended up finding this line. Open user local slurp.confee. And then we can see that it returned with a value that says no such file or directory. So Strace is telling us that the application tried to open this location and didn't find it. That seems simple enough. Carl certainly could have made the error message a little kinder. And frankly, that's a really odd name to be giving a configuration file. Don't name your config files like this. But Strace has given us the ability to watch what's going on at this level and find something that we wouldn't have been able to find otherwise. Strace is useful for a whole lot of other things and there's tons of resources around it. It's great for watching things with process invocation, writing to any sort of file descriptor, any sort of thing where you need to watch what your interactions with the operating system are doing. A couple of great resources. Julia Evans has been doing a lot of blogging and even has a zine that's out there about Strace that will show you a bunch more flags, a lot more things that it can do. And James Gallick also had a wonderful presentation at Gorucco a couple of years back about using this to debug things in a language that he didn't even know and how he solved a real problem for one of his friends when no one else could on the server live. So Strace is a powerful tool and you should definitely take a look at it. But Strace doesn't solve everything. I mean, it looks at the syscalls that are happening. It can tell us about the operating system interactions but sometimes we need other things. So we bring ourselves to the case of the unused method or was it? In this particular case, there was an application and this application allowed for our external users to write some scripts that interact with our system. So this is kind of a little mini DSL that we give them access to and that they can run. And in particular, there was a bit of code in there that was really becoming problematic for us to maintain. But these scripts are dynamic, people are passing them in and so there's no way for us to really easily see whether things were being used or not. You know, there was a theory running around that like this was kind of a deserted part of the system that nobody was really using these objects and we should just get rid of them since they were causing us maintenance pain. But the business and product side didn't want us to get rid of them without some proof that they were not being used. Now, there are a lot of ways that you could skin that you could deal with this that you could work around it, find out what's going on. So I've worked for New Relic, you could put a method tracer on the methods that you care about and watch to see whether those end up getting anything. You could do some sort of logging. But unfortunately, we were under the constraint that shipping new versions of our software was actually really hard. Like you'd have to wait a long time and go through approval processes. It was very tough for us to do that. And this brings us to another moral and that is make proper preparations for the things that you want to do later on. So in this case, we were fortunate that somebody had installed an application, a Ruby gem actually for us, called RB Trace. RB Trace allows you to get visibility into method calls as they're being made into the Ruby VM. So much like our S Trace diagram that we saw earlier, when somebody makes a method call in Ruby, either on our own objects or on core objects, RB Trace is able to spy on that and tell us what's happening. In the simplest case, you can give RB Trace a list of methods and then have it execute a script and then it will give you an output every time that one of those methods is called. There's a variety of other flags, you can have it time how long they are, apply timestamps and some other formatting as well if you need. But at the basis form, this can just tell us when these methods are happening. So this seems like something really powerful that we would be able to use for detecting whether this code that we suspect is unused is actually unused in practice. Now, starting up a whole new application from the command line and altering how we're calling it to put RB Trace on the front of it was not really palatable to our operations folks. But thankfully, RB Trace and S Trace has some similar modes, actually allows you to attach to a process after it's already running. Now, in your application, you have to have required RB Trace, that's the one code tweak that's actually necessary for this particular technique to work. But once you've done that, you can say RB Trace minus P, give it the PID and then the other parameters that you pass to it are set up the same way as in the other invocations that we showed. So with this, we were able to set up this Trace, run it on the process and just log that output to a file. We watched for a long time, watched for a period of time that we had arranged with our product people. Unfortunately, we found that there was no output in the file. Nobody during that period of time had actually called these Ruby methods that we cared about. And so we were able to make the changes that we wanted. This sort of method level tracing is really useful and RB Trace also provides some built-in sets of methods. So there are tracers that you can apply for IO methods, active record methods, and even you can make your own arbitrary lists of methods that you can pass in for it to do. So this is really powerful if you need to find out what Ruby code is actually being called dynamically at runtime. But that's not the only thing that RB Trace can help us with. There are some other things that it gives us visibility to, like in the case of the occasional lag. We had a system, and this system on 100 millisecond timer would make a call out kind of a heartbeat ping across the network to another service. And we noticed that we would consistently get the number of heartbeats that we were supposed to. But every once in a while, there was a lag. We would see one of the heartbeats would get delayed by a certain amount and then pick back up and things were very flat. Now, this was not super alarming, but we kind of wanted to understand why this wasn't happening as consistently as we thought because it's so rock solid the rest of the time. Why at this particular moment would we end up getting the sort of hang? And the key insight that we eventually came to was that there were some background tasks that would run in this process, and they would happen at about the 10 minute mark here. And right after we would see these background processes happen, we would get the bump in the heartbeat. Now, this was a good hint because of the timing, but it didn't make a lot of sense, right? The background tasks that we had were running off on a totally separate thread. They were completely independent of this heartbeat thread that was supposed to be running and doing this operation here. And neither the heartbeat nor the background really had anything to do with each other apart from coexisting in the same process. So why was this slow? And how could these tools tell us why that was? RB Trace comes to the rescue here with a flag that it has for being able to trace garbage collection. So if you pass minus minus GC, in here I'm using the minus T flag to apply timestamps to the output which allows us to see a little better. We had some output that was telling us when we were running our background job and we could see that immediately after that background job ran, we would get multiple garbage collections occurring. And at the steady state of our application, the rest of the time, garbage collections were far fewer and more further between and did not happen in these sorts of clusters. Now, if you dig into Ruby's garbage collection, when a major GC happens, the VM is paused. There is a portion of the time that although they've done a lot of work in recent versions to make that smaller, there is still a point in time when everything pauses for these GCs. And so if you generate a large enough pile of objects, everything's gonna have to stop at some point while those get cleaned up. Once we found this out, we were able to look at some other independent sources. And so like New Relic has a graph that shows object allocations and a variety of other Ruby VM things. And we could see that on the 10 minute mark, sure enough, there was a very large spike in our average object allocations. And that corroborated the theory that this was the thing that was causing our heartbeats to stagger. So taking that information, we were able to clean up, allocate fewer objects during those background tasks and got things back onto a steadier footing. But our retrace was able to give us the hint of where that problem was and show us the garbage collections that were happening silently underneath us. If you're interested in RB Trace, probably the best source to find out about it is its GitHub repo to begin with. Amon Gupta from GitHub is the primary author of it. And RB Trace is positioned well as something that you can use in production. It's designed to be low overhead and to be only enabled when it's actually doing some tracing. So definitely take a look at it. Justin Weiss had an excellent blog that kind of goes through a lot of the basic setup that we've covered here today as well. This brings us to our final case. That was the case of the delicate negotiations. In this particular case, I was working on an application. It's actually the Ruby agent that I formerly referred to. And what it does is every 60 seconds, it gathers up a bunch of data or it gathers the data across a minute and then transmits it to a service across the internet. Talks to our collector, we called it. And what we were doing at that point in time, naively, we would package all that data up and make a new net HTTP request. And this would be a completely fresh request from scratch. Kind of looks a little bit like this, different URL. But basically every single time we would make a full new connection to New Relic, we would go send the data and when we were finished, we would shut that down. And then when the next minute came along, we would do all of that again. And there's actually some costs that comes with this because these are connections over HTTPS. And so there's SSL negotiation that has to go on. There's additional packets that go across the network to communicate the certs back and forth. There's computation on both the client and the server that has to happen for this. And so each of these new connections that we start up has a cost associated with it. And when you're running something that the scale like New Relic does, saving that sort of connection time on the server is a big positive. And it's a small positive for the client as well to be able to hold on to those connections for a longer time. In sort of rough Ruby code, that looks a little bit more like this. So we set up a connection, configure it properly, and then we tell that connection to start. And that will go make the connection out to the service that you're talking to. And then subsequently we can make multiple requests across that connection. And those don't have to reestablish and do that initial work. They can just send the data across the socket that we're intending to. So we plug this code in to reuse these connections and the question comes up of how should we test it? So we do the functional testing. We see that the data is still being transmitted, but it's really hard to pin down our staging environment, the one request and whether we're reusing it or not. And anything that we would do to log stuff locally could be lying, right? When you put in logging code for something, if you think that it's doing the thing that it's supposed to do, your logging is gonna tell you what you already think. So what we needed was we needed a way to look at the layer below us and understand what this was really actually doing. To that, we applied TCP dump. Now there are some other tools like Wireshark and things that play well with us or that can be better replacements sometimes. But TCP dump is a really simple tool that's available on Linux and other Unix systems for spying on the network packets that are being sent back and forth from your machine. In our code, it looks at the communication that happens from the Ruby level over to our network stack. And so the moral here, and this is a moral that I've kind of been beating on with all of these, is to find the boundaries and realize that you can spy on them. Realize that when there is a system that you hand off to, that there is probably a tool that lets you see that traffic. And then interestingly as well, as you learn more about the systems that are underneath you, you'll realize that there's layers within layers here. So we talked previously about our kernel and the syscalls that we're going back and forth between our code. In fact, when we make the call to the kernel or we make our HTTP call, that's being a right to a socket and the kernel itself is actually making the network calls that we're tracing with TCP dump. So there's layers everywhere in your system and you can find out what those layers are doing. Let's take a look at what it looks like when we actually run TCP dump. So as opposed to the other processes where we would attach it to a particular process or run a particular script using it, TCP dump sort of runs a little more ambiently. You run it on the same machine and it spies on all of the traffic that's going on. Actually, it's fun to spin it up on your Mac or your laptop machine and see how much stuff is actually flying around that you had no idea that your computer was doing. It's actually pretty baffling how much output that can give. But it looks a little like this. Some new lines inserted to help. But let's break down one of those first things because this is, again, will give you just a huge stream of output. So the first option here, we'll take a look and dig in too deeper. The first item in it is a timestamp. So this tells you when the thing happened. It can be useful if you're trying to correlate things and logging over a period of time. The second piece tells us the addresses that are involved in this packet communication. And in particular, throughout the rest of the slides that I'll show we're abbreviating those addresses. So the one on the left was me, my particular address here and the other side was I was making these requests to Google and the angle bracket there tells us which direction that communication's going. So this was us sending a packet to Google. There are a variety of flags that get applied to these packets and in this case this has a flag of S. S is the SIN flag. So this is part of us starting a communication with Google and setting up that socket communication. Well later on we'll see dots and P's as data is sent back and forth and there's an F flag that gets sent when we're done with the connection and we're closing it. Finally, there's a bunch of other information. This may be useful for different cases depending on what you're looking at. And you can actually pass verbosity flags to TCP dump that will do even more information out or even output the entire contents of the packet for you to be able to spy on. There are tools that will allow you to do that but for our purposes here, what we care about is just inspecting that communication that happens so that we can confirm that our connection reuse is working. So as a reminder, this is what the naive form of the code looks like when we establish a connection every single time. So let's take a look at what this looks like when we actually run TCP dump on it. So we get a big wall of output and it starts off with the SIN flag back and forth. So that's us talking to Google saying, I wanna talk to you, Google saying back, yep, sure, that's great, we have a connection established now, let's go. Then there's a variety of packets that gets sent back and forth. I've excluded about 24 of them in this particular case for a very basic request. And part of those packets that are being sent back and forth is the actual communication of Google's SSL certificate and then us responding back that we see that and making sure that this connection is secure for us to be able to talk across. And then part of it is the actual packets of the response, the request that we're making to Google and its response back to us. So there's a big pile of data that gets sent here and then once we're done, because NET HTTP closes out this connection when we use the naive form, we have a finish flag, we say we're done, Google confirms that we're done, the connection's closed and we're finished. So how does this look different if we share the connection and recall on that same thing more than once? Well, the first part doesn't really change. So the first time that we make a connection across that shared object, we're still gonna see our send flags, we're still gonna see the same amount of data sent in that first pass, but the big difference is we're not gonna see the finish on the connection here. And so this tells us that we kept that connection open to Google and we can potentially send more data across it as long as both sides are still happy. There are timeouts that come into play, eventually the client or the server can decide that they're gonna close the connection and we'll have to re-establish, but as long as we're within a window when we're both still happy that it's open, you can send more data. And in fact, when we make that second request, this is all of the output that ends up getting shown for those packets. So you remember there were almost 24 packets excluded, plenty of stuff to scroll off the screen for this all fits on one screen now for just the request that we're making because we're able to share that connection. This gets us down to just about 15 packets for that basic request. And we don't have to do all of that negotiation. We don't pay on the client or the server for doing it. And this gives me a warm fuzzy feeling. This lets me feel confident that the code that I've written is actually behaving at the lower level the way that I expect that it should. There's some resources out there for TCP dump. Depending on the exact operating system that you're running, there may be other flags or other things that you need to pass into it to be able to get it configured right. And like I said, if you run this on directly on your local laptop, you will have a lot of noise of other things that are going on to work through. Benjamin Cain's quick start reference was very useful for me in laying out some of the most common options. And former coworker of mine, Joan and Sheffler, he had a talk at PDXRB about the life cycle of a web request. So if you're interested in learning more about what those packets actually are that we were watching streaming back and forth between us and Google as we started up, this will fill you in on all the gory details. Like any good story, this one needs to come to a close. I hope that you found it interesting learning about these tools, Strace, RB Trace and TCP dump. These are really just the tip of the iceberg. There are so many tools out there that you can use to dig into the lower levels of your system and understand the behavior of what's going on. Always dig deeper. Don't stick in Ruby. Don't think, oh, I develop at this level and I can't understand what's going on below me. You can always get to a deeper level. It was code that other people wrote. If other people wrote it, you can understand it. When you do dig deeper, look for landmarks. Don't panic when you get a wall of output that you don't understand. Look for the things that you do recognize and start there. Be prepared. These sorts of tools often do require some extra privileges on systems or may need to be installed. So think ahead about the sorts of things that you would want to watch on a production system before you need it in the middle of the night when things are broken and not working properly. And lastly, look for the boundaries. Figure out where the things are, where your code crosses over to another system. And I bet you'll find a tool that can spy on it. I hope that this has been useful to you and I hope that some of you will take a peek at that repo, take these cases and go run them yourselves and play with the tools. There's also a bonus case in the repo that I didn't talk about here for you to explore. Thank you.