 So, this talk is about diagnostic reports in Node.js. It's going to cover some of the material that Garish covered yesterday, but it's also going to talk a little bit about, well, I'm going to talk about giving you an introduction, say a few things that you can do with diagnostic reports, how to use them, basics. And then I'm going to talk about some tooling I built to help you use them. So, my name is Chris Hiller, I come from Portland, Oregon. I'm known as BoneSkull on the Internet. So I work for IBM, primarily working on Node.js, related things. I'm a maintainer of Mocha, which is a testing framework. Also involved as a maintainer of Mocha in the OpenJS Foundation Cross-Project Council. And I am BoneSkull on GitHub and on Twitter. If you have nothing better to do, you can look at my tweets, and that's BoneSkull with A0. So, I want to start with kind of some hypothetical problems. So you have a hypothetical problem. Your process crashed. So what happens when a process crashes? If you're lucky, you're going to get a stack trace somewhere, unless things went really south. But so you might get a stack trace, and you've been, you're a developer, and you've been tasked with investigating the stack trace and trying to figure out what's going on. So remember this is a dead process, maybe your stack trace is in your logs. And so you look at the stack trace, and it says, oh, well, you're doing something weird. And the stack trace points to this code where you're saying like, Rymder, and you want to delete a temp directory or something. And you pass this flag, and so the error that you get looks like this. So error, E not empty, directory not empty, Rymder, yada, yada, yada, E not empty. So OK, why would this fail? Some of you may have an idea. So you're passing the correct flag, your meticulous integration test pass, works in your machine, works in CI, builds green, but this happens. So one way to help you figure out this problem is to use a diagnostic report. And can you even see that? Anyway, it says use a diagnostic report. And so let me describe the diagnostic report, and this is the gist. It's an experimental module, some functionality added. It's in node 12, so this is in LTS, you can use it. It is an experimental API, so that means it's behind a flag. You need to pass a flag to use it. Experimental, if you're not familiar, in the node sense, that essentially means the API or the behavior could break outside of the normal major release cadence. So if you do start using them, please be aware that they could break. That being said, they do their job very well. But that API might change, the output might change slightly before we hit the next major. So essentially what this is, is it's a huge JSON dump reflecting the state of the process. Most of them I've seen work out to be about 20, 25K. You can trigger it several different ways, including you can give it some command line flags. You can trigger it programmatically. You can even tell it to dump a diagnostic report when you receive a user signal. So how do we want to create a report, in this case, where we've got this process that's crashed? So we're going to start up that process again, except we're going to give it these flags. So experimental report, you need that to do any of this stuff right now. You're going to say report, uncut exception, and then give it a nice filename. You don't need to pass the filename, but in our case, that would be helpful. But normally it'll create this very long filename based on the timestamp. So you run this in your production, and time passes, and now you have another problem. So now you have a diagnostic report, and it crashed. And now you have a lot of JSON. And so it looks kind of like this, where it's just like this blob. And we can kind of zoom in and maybe take a closer look. So it contains a whole lot of stuff. And I'm going to try to run through this pretty quick. So there's nine or eight, depending. Nine top-level properties. And the first one is going to be header. And that's going to talk all about the report itself, information about the node process, the command line, you can see the version, the versions of the libraries that Node uses, operating system, version, CPUs, all sorts of stuff. So that's going to be in the header. Next one, we will scroll down, and this is in order. So the next one you're going to see is JavaScript stack. And it's going to, of course, give you the stack. In this case, it crashed on an error. Next, you'll get the native stack, which is pretty far under the hood. And you may or may not need that. But it's there anyway. The next will be information about the heap. So this is your memory usage, resource usage. It will be your CPU usage, a little bit about file system activity. Next is this libUV. Might need a better name. But it's essentially the state of the event loop. What's in that event loop right now? And so it gets a little technical, but there's stuff in this particular event loop. And over there, environment variables, this has been trimmed. But it's everything in your environment. Windows users will not get this. So user limits, if you're a user on a Linux system, you'll have like limits of what you can consume. Shared objects will be the shared libraries that Node is using. And so what we are concerned with, what can help us solve the problem we have? Well, it would be here in the header. So we look in this header, and we see we want to focus on this, the Node.js version. So the problem here is rimRaf with that recursive flag didn't land till 12.10. So your Node version is too old, but a stack trace wouldn't tell you that. So great. Hey, you found the problem. Good job. So you take this, and you want to say, oh, look, this is the problem, everybody. And as you go into Slack, and you take this big report, you paste it in there. And now you have another problem. And what you did was you just leaked the entire environment, like in the Slack, or wherever you sent it. Maybe you sent it through email. Hopefully you didn't put it on Paceman. But yeah, there could be your AWS stuff in there. Who knows? So your team lead is pissed. And so that's kind of what we need to avoid. So what are we going to do about this? If you want to send one of these report files around, you need to make sure they're kind of scrubbed of things that shouldn't get out. And so what you do is you go back, and you delete your Slack message, and you go, and you open the report. And you delete the secrets, and then figure out how to exit VIM. And then, of course, this is all very tedious. So there is a tool that I was working on, and it's out now. But it's called Report Toolkit. And it's a tool for processing and analyzing diagnostic reports. It's kind of a multi-tool, so it does several different things. It's not Unix-y, you know? You know how multi-tools kind of suck to do any of those? Anyway, so they don't do any one thing. Great, but I'm getting out of myself. So this thing is going to, this does some cool stuff. It gives you a CLI tool to consume these things, and there's a programmable API. You can check out the docs, which are incredible. And there's the repo up there. So what can we do? So we can use Report Toolkit. And give it this redact command. And pass it the report.json file, or foo.json, whatever I called it. And what this command will do is it will look for things that it knows are potentially naughty and need to be kept secret. And it's based on the blacklist that AWS has get secrets project use. You may be familiar with that. But you can kind of customize it to your needs. So what it will do is it'll replace all those terrible secrets in that report file with the string. And so it'll override the file in place. So nobody's the wiser, right? And so now you can safely pass this report around. Share it with your colleagues, discuss it over dinner. But so the time passes, and you have another problem. So you have this process. Maybe it's even a test or something. But you have this process, and it's running. But you thought it should have stopped. So it's not a zombie process, but I'm just going to call it a zombie process. So you don't know why. And this is weird, because so you got this process, and you don't know why. And so you open up your debugger. And it doesn't stop. It's not doing anything. It's just sitting there. It's not hitting lines of code. You set breakpoints, whatever. So you don't know why. One thing you can do, this is something that diagnostic reports can help you with. So you can actually generate a diagnostic report on demand. The process doesn't have to crash for you to get a diagnostic report. And so I know we love command line flags. And so we can send report on the signal. And so by default, what this will do is the process will respond to the user2 signal. And that's configurable. But so you'll start your process. And you can do this sort of thing with the process ID. And so that sends the user2 signal. And when the process receives that signal, it will say, ah, it's time for me to create a diagnostic report. And so it'll dump a diagnostic report up. So you look at this diagnostic report. And then I'm going to cheat, because I know where to look here. So I would look in this libUV property. And I would go down and look, oh, look, there's this timer. And so this timer, and it's active. So it's in the event loop. And it's referenced. So it's still hasn't been garbage collected. If I was in MS from now, 999, that's a while. And so you can see that using this, you can get a clue. So I must have created some set timeout, or some interval, or something. And I was off by several orders of magnitude. Who knows? But that will give you a clue to try to figure out, oh, this is where a problem could be. So report toolkit, if you don't know where to look, so it can do this sort of thing for you. And so it has this inspect sub-command. And this is the thing that I think is really neat. So there are these rules. They're heuristics. They're just some algorithms and functions that accept a report file. And the function examines the report file. And it decides what to do. And so there are built-in rules. One of these happens to be the long timeout rule, which will look for this very situation in your report file. And so you could run this on your report file, in your report file, really. And it'll look, and you'll see, is there anything fishy going on here? So one of those rules is the long timeout one, where it will let you know if there is a timeout that's far off in the future and it's still active. And so you could write your own rules to this. It's like a plug-in system. And so you could write your own. It works similarly. Some rule, I can't even say that word. But that's how it works. It works like ESLint would. And so you can write your own rules, publish to ESM. You could have a talk to the blockchain. You're not sure why you do that, but you could do that. And so this is what the output would look like. So pretty simple. So it's just like this kind of tabular thing, where it says, oh, there's an error, it's a very issue in this report file and the rule that was triggered is this one. And then there's this thing with this bad expiration date or time. So that's one of the rules. There's others that will look and make sure that your memory usage is within an expected range. Your CPU usage is within an expected range. There's another one that actually will examine your shared libraries versus the libraries that Node was built with. And if there's a mismatch there, so that's not going to be something that most people can be concerned about. But if you're compiling Node, that might come up where you, say, have a different version of OpenSSL than Node expects. So another problem you might have is you've got this flaky process. And the flaky process, it's running and you're not sure why it just kind of, it fails once in a while. Maybe it fails on one machine but not the other and you can't really tell what the difference is. So one thing that Report Toolkit can help us here is it provides a diff sub-command. And so you could take a ReportA.json or 4b.json, give it to your favorite diffing tool. But that's for diffing source code or text files. It's not for diffing these report files. The neat thing about, when we know the data we have, we can create a custom, a purpose-built diff tool for this. And so that's what this is. It tries to ignore stuff that it thinks you probably won't care about. And so it tries to kind of have signal-to-noise ratio. It tries to make it nicer for you to look at two reports and say, oh, well, that's how they're different. Instead of this huge unified dump or side-by-side diff. And so it answers your process. If you run this again and again and again, you can diff them all and say, how does the process change over time? Maybe that's a single process. Maybe that's a process on several different machines. But you can diff any two reports this way. And the diff output looks something like that. In this case, we see that the command line flags are a little different. So with this first report file, we actually said dash e for eval. And so the command that was sent was actually, hey, just write a report. The other one, who knows? But it didn't have any command line options. The first report was generated with 12.1. The second one was generated with 11.2. And so this is an excerpt of that diff. But yeah, that's kind of the idea there. And if you don't like the tabular output, you can choose different formats. Maybe you want to in JSON or CSV or something. Another thing is maybe you've got processes that are crashing somewhere. Maybe you have a lot of them. And maybe you're like, that's not a big deal. We can just restart them because it's no, right? But so you want to know how frequently certain exceptions are happening. And maybe this will help you prioritize bug fixes or who knows what. But to be able to figure this out, how often does a particular exception happen, you need to be able to count them. So how do you count an exception? Well, you need to somehow take the whole exception and stuff that who knows. But you could take a, what you can do here is you can take a hash of that exception. And you can kind of, there's some customization that can happen here, but you can take a hash and actually just kind of output this a little bit of JSON with an SH1 here, use the report toolkit. And of course, you can do that with a script or a report toolkit. It'll do it out of the box. It'll also convert these diagnostic reports to CSV JSON, even filter stuff. So if you only want a couple of those fields, give the filter table, of course, as that output you saw before, a new line would be something like new line delimited JSON if you need that sort of thing, a numeric. I kind of, that is kind of an experiment where you can use it in a shell context where you can actually pipe it to something and maybe generate, there's like these neat little tools that'll generate like graphs in your console. You could do that and just combine it with filter and only pick out a certain field and keep running that over time. Redact, of course, is essentially the same thing as the redact command. So you can combine these transforms. You write your own, publish them to NPN, use them. No, you can't do that. But so this is what it's something would look like. So you'd get this stack hash. And you can see there is an SHJ1 hash calculated for this. I think you need to be able to customize this a bit. Maybe if your exceptions have some user information in them and you want to get rid of that, maybe there's some personally identifiable information in there, you should be able to pass it a regular expression or just a function and write your own and plug it into this thing. And it'll help you generate this stack, that hash, and then you can give this to your logging tool or your metric system or what have you. So I think that's about it. But what we learned is what a diagnostic report is, how to create them. It's not every way you can create them, but that's a couple of them. You can also create them programmatically, which might be useful if you're trying to grab them in like a serverless environment. How you can use them to solve certain problems. They're especially useful, of course, in post-mortem debugging, where you don't have the option of running a debugger because your process is stopped. So, and of course how a report toolkit can help you work with diagnostic reports when they become tedious or how it can help you uncover problems that you may not be aware of. And so if you want more information about diagnostic reports, of course, it is in the lovely Node.js documentation. There's a tutorial written by Grish who spoke about diagnostic reports yesterday and also he was the one who got this code into core. But there's a tutorial there, which links to, goes to developer.ibm.com. You can also, and I apologize, this is not very legible, but the documentation site for report toolkit is ibm.github.io for its report toolkit. I'll leave that up for a second, but it is an IBM project. I'm the only person working on it, but it is still an IBM project. And so again, I am Christopher Hiller. You can call me Chris. I work for IBM. I like Node and Mocha and stuff. Look at my website and things. So thank you Montreal and Node.js Interactive.