 computers work exactly the way I expect. But unfortunately, as we heard this morning from Ramsey, and as we all know, that's not always what happens, right? They don't always cooperate. But what I'd really like to do is it'd be nice to at least be able to figure out why the computer and my system isn't aligned with my expectations. So recently I've been learning a lot about detrace, and it's been helping me to figure out why things aren't going the way I expect. So I'm going to show you an example of a problem that I ran into and how detrace could potentially help solve this problem, but first let's talk about what it is. So detrace is a tracing system for the Solaris Illumos world. Now it's also on free at BSD, and OS10 of course, which is what I'm using, which is awesome because now I can use it. And it lets you write programs to trace events that are going on in your system, and then filter, aggregate, or take other actions based on those events. So this programming language that you use is pretty limited, but that's by design. It's a good thing. Your programs, your programs, are running in kernel space, which is insane, but detrace is meant to be safe to run on production systems. And safe, this is not in the NASA sense, but more in the web scale sense of safe, right? So for instance, there's no loops in the language. This language is not Turing-complete, which breaks my brain to try to use a programming language that doesn't have loops. But that way detrace has an easy time of making sure that your program is actually going to halt. The halting problem applies to Turing-complete languages. So what kinds of things can you do with detrace? This is what the detrace language looks like, looks sort of like AUK. I believe this is at least the third mention of AUK today, which blows my mind. It's awesome. There's probably been more. But a little bit like AUK, a little bit like C. And this program traces two kinds of events, system calls and the detrace system itself actually ending. So what we're doing is counting up all the system calls on the computer, grouped by the name of the sys call, probefunk, and the name of the program that called it, exact name. And then when you control C to end the detrace, it truncates the results to the top 10 and prints them out. And it looks like this. Which is pretty cool. So you can also trace things that make sense for a given application, like MySQL or a language runtime, like JVM. And you can add probes for your own applications too. The MySQL developers wrote some code that lets detrace see these events when I run a few simple queries. And here's where things get really crazy. So in addition to being able to trace things that the developer thought of when they were writing the code, you can dynamically insert new instrumentation into the code, unless you trace arbitrary C functions. You can do that for user land code, but you can also do it for kernel functions, which completely broke my brain. So what we're doing here, I can actually trace how long different kernel functions take to run and download the kernel code and see what those actually do if I want, which is a pretty cool learning experiment for me. And there's also, you're seeing here some nice frequency distribution reporting. It's orthogonal to the particular kind of tracing we're doing, but it's just another way to aggregate. You can trace new things that the developer didn't think of. All right, so the issue we're going to look at happened in real life, but it took me way, way longer than these 10 minutes to solve it back when I didn't know how to use detrace. Okay, so here's the deal. My team's got some tests that are intermittently really slow. They're pausing for like 30 seconds, which is, we expect it to be milliseconds. Julia told me in a blog that computers are fast, so this is terrible, like why is it taking 30 seconds? It's not just one task. It's not just one person who's having this issue. But some people do seem to have it worse than other people. Is it the features they're working on, or are they just more vocal than other people? Unknown, we don't know. Anyway, this test is taking way, way too long, like many minutes for trivial changes, and it's driving us crazy. It's making us feel really bad about our abilities as programmers. So, okay, quick mental note of your ideas. What should we check? There's lots of things, lots of hypotheses that could explain this kind of behavior. We could have GC pauses, we could have deadlocks, connection pool timeouts, we could have a sleep that somebody wrote in the code that just fires randomly, right? Lots of bad things can happen, but we don't really have enough information here to solve the problem yet, and you can probably guess based on the context clues around this talk that the next step is going to be to gather some data to support or reject our hypotheses using detrace, and then generate new hypotheses based on what we learn. Okay, so this app runs on the JVM, and when I'm on the JVM, I always think it's going to be garbage collection that's the bug, and lots of times that would turn out to be right, but it's easy enough to test that idea, we can write a detrace script and we can say, okay, we're going to trace the GC begin event and the GC end event, take the difference between the timestamps of those two events, and print it out. It's not the only way to learn about how GC is going, but it works. Ultimately, our results tell us that 160 milliseconds is the slowest GC, which is a fairly long GC, but it's not anywhere near this 30 second culprit that we're looking for, right? This is on that scale of things pretty small. So let's back up, what higher level resources can we look at? CPU is a pretty important resource. We could sample all the on-CPU processes, like 997 times a second here, and aggregate the counts by name, see who's on CPU the most. Turns out there's a program called kernel underscore task, it's running by far the most, and if we drill in to get the stack traces that this is running, we could confirm this pretty much all idle, right, nothing's going on. And incidentally, our Java program only shows up on CPU three times for the whole time we were tracing, so there's not much going on CPU-wise. This is great, because it lets us rule out any hypothesis that we had that had anything to do with Java being on CPU a lot. So GC, tight lubes, some serious computational algorithmic stuff, too many threads competing for CPU, there's a lot of stuff that we just ruled out. I think in general, starting fairly early with these kinds of high level questions is a good strategy, because it lets us save a lot of time by, you know, it's similar to a binary search in that way, right? It lets us cut out a huge swath of the search space, and we ideally don't have to search all end of the possible performance problems we do sort of closer to log in. So we could similarly rule out memory and disc slowness with other detray scripts, and they end up being clear to all charges as well. So maybe we want to take a look at networking next. We definitely do a lot of calls to the database, right? This is sort of a webby sort of thing, and we might say, okay, we can trace all socket connections and see what's slow there. The code for socket connections is way too complicated, mostly due to some housekeeping and format translation details, but also due to some other things. It's not even something I'd be able to write easily, honestly, but luckily it's free online. It's in the detrace book. It's on Brandon Gregg's GitHub, and it gives you the connection latency, the socket connect latency for every process and the address you're connecting to. So it turns out our database connections are really fast. There's a local host, which we expect, and then we do see some latencies that are on the order of 20 to 40 milliseconds. Those are a little bit slower, but it's not like there's thousands of these, and we're looking for a 30 second, something that's taking 30 seconds. On the other hand, it's interesting that we're connecting to this external host, the 72524119 on port 80. I'd expect it only local connections when I ran this. So that IP address, we can grab our code base and see that it's not in our configuration or code anywhere, makes me sort of wonder what that host actually is, and what is the thing that we use to see map between host names and IP addresses, that's DNS, so we could Google around for like, how do we trace DNS queries? And we could do that by dynamically tracing, inserting our own instrumentation into get at our info calls, which comes from some system level shared library. And it turns out we find a URL that takes 30 seconds to resolve. someplace.com turns out to be. This isn't me just typing a fake address for this example here. This is literally the thing that we saw in our code that we were trying to connect to. someplace.com, right? And it took 30 seconds to resolve. And this was the underlying issue. This was the actual thing. I don't know what this URL does, I wouldn't go there. But the people getting the slow test turns out they were actually all having slow DNS lookups for this host and a lot of them were in the same office and so maybe, who knows? But at this point, we now know what the problem is. We have a clear idea of what the problem is and now we have a number of options. We can complain to our ISP or all those people's ISPs that their DNS is slow. They can switch to Google DNS. They can replace someplace.com with example.com which turns out to be really fast here in this example. Or we can do what we actually ended up doing which is a lot more resilient which is just we shouldn't make external calls in our test way. We don't need to do this. This is obviously a bogus address. We don't need to do this, right? So just fake it out. Okay, so, but the point here is that we now understand the issue and it feels super, super good to know why this thing was happening. Okay, so let's take a step back and review. DeTrace can help you get your crazy theories validated or rejected. Clearly not the only tool. There are lots of ways we could have gotten to this solution here. But it's awesome. It can ask really, really broad questions like what's on CPU the most or really, really focused ones like how long does this very specific C function take to run. And I also wanna say it doesn't take years of study to get great use out of this or tools like it. I just started looking at this last October and I definitely found like even just from learning a few one-liners and Googling around, I still don't consider myself an expert at this at all but I've gotten a ton of great learning out of this. Tons of valuable insight in solving problems. So what do you do next if you're interested? Definitely recommend these free online resources. Brandon Gregg's blog's great, The DeTrace Guide. There's a 1,100 page book if you're really wanting to get invested. It was terrific as well. And also as a developer I learned a ton by reading other people's code so it was really helpful for me that since DeTrace Code comes pre-installed on my operating system and some of these that come with OS 10 are just one-liners and so they're totally approachable. There's a few bugs here and there with them but it's great, you can learn a ton. And also I'd say this isn't just a tool for solving performance issues. You can also treat it as a general learning about how your computer is working tool. It's been great for me in that way. I downloaded the ZXNU kernel code just to learn about how that works just as a result of digging around with DeTrace a little bit. And also if you're on an OS that doesn't support DeTrace or doesn't have great support, see what other tools you've got available to you to help get your questions answered. For example, there's some really cool things that are being developed for Linux under the IOvisor GitHub organization so check that out. So using DeTrace has taught me a ton about performance, operating systems problem solving in general and I really hope it helps you too. Thanks.