 Good afternoon, and thanks for joining this talk. My name is Girish, I work for IBM. I'm a member of Node.js Technical Steering Committee, as well as a member of the Diagnostic Working Group. I've been working with quite many customer issues on production deployments in the Node.js, and this talk will contain some of the experience from the working with customers in the field. So as we know, diagnostic best practice is quite vast topic, and each of the scenarios, each of the best practice takes hours together to discuss to the bottom. So in the next 30 minutes or so, we are going to talk about the top three problem statements, top three problems as seen in the field, and then what are the tools that can be applied on these top problem symptoms for the better problems determination, and what are the best practices around that, what are the capabilities, and what is the support in terms of test coverage, long-term support, and things like that. Then from the diagnostic working group perspective, how we looked at these problem symptoms, what are the state of these tools a few years back, and then how did we approach the tooling gaps in terms of the capability, the coverage, and the support perspective, and how we came up with more comprehensive capability of these tools. Then that leads up to the discussion about user journey or use case-driven tooling development, which is basically an innovative initiative from the working group, and then we also talk about the working group itself, the diagnostic working group which is part of the Node.js core projects, and what are the items which people are working on, what are the challenges, what are the opportunities, and how do we get engaged. So that's pretty much the agenda. The most common problem symptoms that we observed in the Node.js production deployments. Number one, memory leak. As we know, memory leak is basically a mismatch in the expectation between the program and the garbage collector. The garbage collector's contract is basically, if you have an object that has a strong reference in the application, I'm not going to collect it. Whereas if the object is not in scope, or there is no strong reference, I'm going to collect it in the very next opportunity I get it. So that's the contract. If the application have a strong reference to the object by mistake, or through some incorrect programs, that leads up to memory leak. Out of memory is nothing but the culmination of memory leak. When a lot of such objects which the program thinks should be collected, but it doesn't come into the purview of garbage collector, over the time these kinds of memory gets accumulated and leads up to out of memory. Abnormal termination, again, it's a mismatch in the expectation between the programmer and the runtime. If the control flow of either the application or the virtual machine itself come to a particular state where it encounters an unexpected scenario from which the virtual machine cannot recover by itself, that leads up to a scenario of abnormal termination. The other side of abnormal termination is called exceptions which is a little bit lighter version of the abnormal termination, whereby the application encountered an unexpected scenario, whereas the virtual machine catches it and composes in the form of an exception and throws it into the column. But in either case, we can call it an abnormal condition where the program logically cannot continue. Hang, basically low CPU, the program thinks that it should be moving ahead, but because of the peculiar control flow or a bug in the program, it is just stuck at some point in time. It thinks it doesn't need to do anything, so waiting for some condition to occur which is never going to happen. The exact opposite of Hang is high CPU or burnout where through a bug in the program, it continuously loops between two code points, whereas the programmer expects it to continue for running with the transactions. Performance issues, I guess by far the most common symptoms in large scale productions. You have two types of performance issues. One is comparing with the baseline. The current version of the program is performing badly. The baseline could be an older version of your application or an older version of the Node.js runtime. Without a baseline, everything is fine, but you just want to look at the application and improve based on certain criteria or certain parameters which you have. For example, look at each aspect of your request-response cycle and then see where the CPU is spent more within the whole cycle, be it in the JavaScript space or in the Node.js runtime or any of the APIs or operating systems, C runtime, et cetera, and see in which areas the CPU is spent more and how can you go about it to improve the situation? So that's the second case of performance debugging. And the last one is incorrect result. This is mostly caused by a bug in the compiler, basically a just-in-time compiler. So the program thinks that this is the computation and here is the expected output. But because of a wrong way in which the JIT compiler has compiled it, you get into a situation where you get an incorrect result. This is very hard problem to crack because you don't have a visible symptom. You don't have an exception, you don't have a crash, things like that. Over the time, maybe when you do an audit after a couple of weeks or so, you see that the data that is stored by the application is not correct. Then you debug backwards and figure out certain functions are not performing properly. So that's the wide range of common problems that is encountered in production deployments. Now let's look at one of the key problem which is performance. I already explained about performance debugging. Here are the most common activities that we perform. Collect the profile data and analyze the profile data and render it in a meaningful manner which is consumable and easily readable. Now, how do you collect the performance data? By far most of the performance analysis tool use a technique called CPU profiling. So CPU profiling is nothing but sample the application at regular intervals through another thread possibly because if you profile in the same thread which is the main application thread, the profiling itself incurs some penalty. So usually you spawn a new thread and this new thread is responsible for collecting the CPU data without incurring any penalty to the application. And there are very sophisticated CPU profiling tools ranging from the V8 itself which is the runtime engine to the operating system. Basically some kernel threads which will be examining your application at regular intervals so that there is no penalty that is incurred to the app. Analyzing profile data. Again, here the task is involved is basically to translate the CPU samples that is collected to some meaningful mapping to either the JavaScript objects that is responsible for accessing the CPU or the symbols. Basically the functions belonging to JS app or the Node.js API, the C++ wrapper or the C runtime or the operating system. Anything that comes in the transaction. Rendering the bottlenecks, visualizing the bottlenecks by far the most common tool is Flame Graphs. So Flame Graph is nothing but a two dimensional graph which has two parameters. Number one is a called up. The height of the graph is basically a sequence of calls. If A calls B, B calls C, then that A, B, C comes in one single vertical pillar. And second parameter is the color coding. Basically a very simple color coding that ranges from light amber to dark red. So light amber meaning a very less amount of CPU is spent on that method and dark red meaning the opposite. That's where most of the CPU is spent. So you can expand on each of these pillars or columns and then zoom in to see what is the signature of the method and get the finest level of details. Now looking back from the user experience angle, how did we reach there? The state of the art was not that sophisticated. We had a lot of problems a few years back when we looked at the performance. So the first one is a lot of tools which were producing different types of formatted data. So the format of the profile data were not compliant to each other or compatible to each other. So we had to come up to a standardization model where these tools come to a common ground so that the flame graphs or the rendition tools can work in a seamless manner. Second one is lack of support for native frames by which I mean from a JavaScript application programmers perspective, everything is a JS app. But if you're profiling the CPU, there are many other players in the stack. As I said, the C++ frames, the Jellip C which supports the NodeJS EXE and then the operating system routines. So when you're profiling your application, the time could be spent anywhere in the stack. For example, if you're doing a highly CPU intensive reading from a buffer, the actual time spent in the CPU would be seen in the kernel as opposed to the JavaScript stack. So the profiling tool should be intelligent enough to show where the time is actually spent. That was not the case earlier. So we had to implement something called a stack walker which will precisely tell which part of the application stack where the CPU is spent. And when we switched the just-in-time compiler from crankshaft to turbofan a couple of years back, one of the most funny thing happened is the initial part of the application bootstrap or warmup, most of the methods would be interpreted and it would take some time to figure out which are the hot methods which gets candidate for just-in-time compilation. All these times the method will be interpreted and if you profile such an application, what you're going to get is the interpreter's symbol as opposed to the actual JavaScript method which is a very poor user experience because for example, if you're profiling for five minutes and most of these five minutes, the app was running in thousands of JS methods, the whole five minute will be shown against one single C++ method which is the NodeJS interpreter or V8s interpreter which is of no use to the end user. So we had to implement something called a code listener API, code event listener API, which will translate the actual JavaScript method that is being executed from the interpreter stack and pass it back as the actual profile data so that the flame graph can work properly. Then some of the other problems included performance overhead in the production because of the way this translation happens on the fly, you had to give this overhead of running the profiler into the production workload itself. That means if your application is highly time sensitive, you have a high traffic which is sensitive to the time, this is not going to be a sustainable model. So we have to implement fine grain control of the profiling, don't profile the application from end to end, rather figure out certain workload patterns or high volume or peak load situations and then switch on the profiling only at that point in time and then switch off once that workload pattern is seized. So those kinds of optimizations. And then finally, the profilers used to make use of V8 APIs and what happens when the V8 API introduce breaking changes or non-backward compatible changes. So basically the expectation is that when V8 changes their APIs, these tools also should migrate and adapt to the new APIs. But tool is developed by somebody else, V8 is developed by somebody else. So this break in terms of expectation happened. So that's where the test coverage and long-term support of the tools comes into the picture. So the summary of profiling use case is basically look at from the user journey perspective, what is the actual experience which the user is getting and what are the problems faced while using the tool in an effective manner and walk backwards, address each of them in the core itself so that you get a best in class problem determination experience. Next one is crash. So here, crash would mean abnormal termination, which essentially means either a hard crash in the JavaScript execution or a hard crash in the Node.js land, C++ crash versus JS crash. So the scenarios are debug a crash from Quorum which is a post-mortem debugging or attach a Node.js application in a live manner whereby you know that you can debug in a live debugging manner. You know the problem is recreating every time when you launch. So it is much easy to live debug a problem. And then the third case is you are exhausted with all of the tools like memory debugging, profiling and any other thing is not able to capture the information which you're looking at. The information or the debugging scenario is quite deep into the runtime. You want to inspect the objects live, you want to inspect the JavaScript heap, the stack, the instruction pointers, the registers and the values and things like that. So that's when the live debugging comes into the picture and live debugging or Quorum debugging is the most advanced way of debugging an application. That is kind of a dead end of problem determination but it's the most powerful thing as well. Now, as you can see in the right-hand side, it's a composite stack which basically is supposed to show both JavaScript stack and the C++ stack in conjunction. So again, looking from the user experience perspective, the very first problem we had is if you look at native debuggers such as GDB, WinGDB or DBX, these are very powerful tools. They are good for debugging C++ applications and as we know Node.js is a C++ app but the problem is they don't recognize JS frames. They don't understand anything that happens in the JS world. So if you dump the stack, if you dump the registers or if you dump any of the context of the application, you will only see data pertinent to the C++. Basically, it looks from Node.js executable perspective, not from the JS app. So that's again a poor user experience. You want to see the JavaScript objects and JavaScript entities in black and white. You want to see clear differentiation between the APIs and the user app and the runtime elements. In certain scenarios, you don't want to get into the Node.js or the V8 space. You want to confine your problem definition or PD activity within the app which you developed. So in that case, all what you are looking at is what are the objects? How the objects are related? What is the dominator tree? Which object is lying in the heap? Which is eligible for garbage collection and things like that. So basically the answer to that is MDB which is a debugger that understands both JavaScript and C++. And then MDB had certain limitations. It was not able to correlate the JavaScript stack or JavaScript entities with the native entities. And then the second limitation was MDB was not necessarily a cross-platform supported tool. So Node.js currently supports around 15 plus platform combinations, always hardware combinations. And we wanted the best in class support for both JavaScript and C++ contextual problem determination across the board, irrespective of the platform. So these two problems were addressed by LLNode. LLNode in simple terms is a plugin on top of LLDB which is a low-level debugger from LLVM project. And the basic functionality of debugging, the live debugging attaching to a process, looking at the instructions and contacts and call stack, et cetera, was supplied by the native debugger. And then the additional context, the JavaScript insights, the heap and the stack and the object hierarchy, relationship, et cetera, that's coming from the application is supplied by the LLNode plugin. So basically working in conjunction with the LLDB plus the plugin gives you the best user experience. And the last one, again, breaking changes between V8 version boundaries. The same theory applies. When V8 changes version to version, we had to maintain the same seamless experience by porting the LLNode to match the new API. The last one is diagnostic report. So diagnostic report is basically an instance of FFDC, first failure data capture. While this does not help in diagnosing any single problem, as we mentioned the first slide, this is a general purpose problem determination tool, which can be applied for any problems in the production as the first step in debugging. For example, for any sort of problems, you have certain kind of questions that you ask your customer. What is the Node.js version? What is the operating system? What is the heap size? And what is the characteristics of your application? What is the configuration and things like that? So if you take a snapshot of the report, which basically captures a lot of data from the running application, it contains all this information in a JSON format. So that gives you the very good starting point of problem determination. And in a very good amount of cases, this report itself gives you the ability to figure out what is the problem and provide a final solution itself. But in most common use cases, this will be a good starting step to provide what is the next recommendation for debugging. So that's why we call it FFDC, first failure data capture. So this is how you invoke it, experimental report, and then either provide a tunable to define when to produce a report or just leave it as it is and allow the app to run. And at some point in time, if you want to capture a snapshot, you send a signal to the running app and you get a report captured. Again, looking from the user experience perspective, one of the main problem we addressed is we did not have an FFDC in the first place and then came the node report which is a third-party NPM module. The problem with the third-party NPM module which is a native module is that you can't use it in production because in most common scenarios, the production will not have a native building capability. You don't have a built tool chain in production for security reasons. So you will either build it locally on a staging environment and then move it on to the production. You do so many workarounds, but it's not a sustainable model. And secondly, again, the version compatibility with V8 and then the report that is produced by the third-party module was not necessarily a machine readable format. It was a human readable format. And then it did not have enough coverage in the CI that essentially means its long-term support and test coverage was not necessarily of great quality. So we looked at them from the user experience perspective and so what are the things that can be improved and came up with these enhancements. The report that is produced in the core is a JSON format and we have a very good test coverage. Part of every release, we run these test cases. That means it's production usable at this point, though it's experimental. We'll have a discussion in the Collab Summit that's happening in a couple of days where we'll discuss more about transforming node report from experimental to stable. So those are the three main tools of diagnostic best practices which I want to talk about. As I said in the beginning, diagnostic best practice is a huge topic. It cannot be covered in half an hour. So I have a dedicated session that spans 60 minutes in the Collab Summit on 14th, 1.30 p.m., where I'll take each of the tools that is in use and developed by the diagnostic working group and we'll look at the capability, the functionality, the test coverage, long-term support and the quality and the selection criteria against each of the user journeys and things like that. Now let's look at the diagnostic working group itself. Here is a goal and should node provides a comprehensive documented extensible diagnostic protocols, formats and APIs enable vendors to provide reliable diagnostic tools for node. So it's not only about providing the tools for debugging problems, but also a comprehensive protocols and APIs which covers the specific user journey in question so that even if you are building a new tool, you can actually adhere to the protocol so that the bad user experience do not come near the picture. So that's the whole idea and these are the support tier model which essentially looking at the available tools and what is the existing state of these tools in the Node.js core? Are they having better CI coverage or are they used in production and what is the target tier which we are looking at for them to be settled? And this is the current set of activities in the diagnostic working group. We don't have enough people working on that most of these cases. So as I told, we just covered three tools but based on your particular deployment and your specific application scenario, you could be having different set of tools and different use cases in your mind. So the best way forward is look at the user journey and see what is the state of the art with the corresponding tool. If the tool is usable for this particular use case and if not, I mean, look at getting engaged with any of these initiatives. We have a bi-weekly meeting that happens in the working group every Wednesday, 11 p.m. my time in IST but it could change for each of you and then we discuss for one hour about the various status of various tools, the focus and challenges and opportunities and things like that. So please get engaged and let's make the Node.js diagnostic very good success story. So just to recap, evaluate each of the tools that is most appropriate for your application and provide candid feedback and get engaged with the working group and see in whichever ways you can engage either based on your skills or aspirations or based on the specific use case you have and that's it. Thank you very much.