 Hello, people on the internet. Thanks for tuning into this talk about consuming observatability features in Node.js. And hopefully what I'm presenting here will be useful in your day to day. But before we get into the talk, let's do a brief intro of who I am here. So like Ryan said, my name is Luke Holmquist, or Lucas. It's either or. I work at Red Hat. I'm a senior software engineer there. And my main focus is on the NodeShift project. This project aims to help developers who are creating Node applications and want to deploy them in OpenShift and Kubernetes kind of in an easy way. You can find me on Twitter at Sienna Luke. It's a combination of the school I went to and my name. I saw there was some tweet going around how you got your Twitter name. So that was the answer to that question. Also just a couple random facts about myself. I'm a huge fish fan. So if you were into the band fish, not like fishing, you know, all that stuff. So if you're also a fan and you want to talk music, you can find me on Twitter with that too. And also I'm the current title holder for Star Wars Trivia contest at my local library. And know entirely too many facts about the original trilogy. So again, if you were a Star Wars nerd like myself, I usually always have my Yoda figure here with me doing the talks. He usually travels, but now that we're remote, he's still next to me. So if you're also Star Wars nerd like myself, you know, find me on Twitter and we can talk Star Wars and that kind of stuff. And as a special bonus, if you do stick around to the end, I will divulge what that winning question was and answer it for you that won the Star Wars Trivia contest. So a little extra benefit for sticking around for the whole thing. And since I am remote at the moment, I'll share where I'm located. I'm an upstate New York. So if you're not from New York, then don't worry about this picture. But if you are from New York, then you'll kind of know what I'm talking about with this photo. So let's get into the talk itself. So congratulations. You've deployed your application to production, most likely in a container on some sort of container platform like Kubernetes or OpenShift. At the moment your application is working well, your users are happy, and everything seems to be going pretty well. So now what, you know, how do you keep your application running and running well? How do you keep those users happy? That's always a key thing. So a key part of the now what question is being able to observe what's going on within our application. For example, how many resources is the application using? Is the application calling a function that is blocking the node event loop, causing users to wait longer than they expect? Or is there a rest endpoint that is having intermittent failures, and your users are becoming unhappy because of that? So being able to determine when things are outside of the norm will help keep your application running and your users happy. So with that said, let's take a quick look at what we're going over today in this talk. First, we're going to take a look at some of the key observability metrics that node provides natively, as well as some new additions that have recently landed. Next, we'll take a look at how we can get access to this data when your application is running a production in a Kubernetes type environment. And after that, we'll take a look at some of the things that you should be collecting. And finally, we'll end with a quick demo with some of the things that we talked about in this from this list. So before we get into it, we should probably define what observability is exactly. So according to Wikipedia, it's a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Well, what does that mean exactly? So in the context of this talk, internal states would be the functions and other business logic that is being executed and processed in our application. And the external outputs are the things that we want to look at or observe to see if everything runs as running as expected. So we can probably rewrite this as a measure of how well our application is running using some metrics that our runtime provides. And with that definition under our belts, let's see some of the different runtime metrics that Node has to offer natively. And since we know that Node is built on top of V8, not the drink, so there we go. That's the right logo. We can access some of that data as well. OK, let's see what's first. So first up is everyone's favorite tool to debug with. And I know it is my console log. It's maybe not the greatest for production, but whatever. I was tempted not to put this in, but I'm sure that if I didn't, that somebody in the internet would probably be like, well, actually, you should have used console log. So this is for all those dudes who are going to PM me after this. I probably don't explain how this works. I'm sure we've all had similar statements like this, whether it's here, or why, or ha, or it worked. So next up, we have some different ways of doing profiling. For those that are new to what profiling is, when talking about coding, a profiler is an analysis software that measures frequency and duration of function calls, as well as making sure that those functions are producing the desired result. While there are third party tools that can be used, Node actually has the capability built in. As I just mentioned, it is built on top of the V8 engine. The V8 profiler allows you to sample the stack at regular intervals during the execution of your application, which gets logged as a series of ticks. This can be done in the command line while you're running your application using the prof flag. This produces a log file that needs to be processed in order to become useful to look at. It can be processed with the dash as prof underscore process flag. This right here is just a sample of a section of what the output or the log output could be. And for this particular example, we can see that there's a node crypto function that seems to be taking up a lot of time. With that knowledge, you can go back into your application code and pinpoint what it is that's happening and make the necessary adjustments. And this particular example, which is taken from the guide on Profiling, which is on nojs.org. The code is using the synchronous version of that particular crypto function. So switching to the async version actually helps clear that bottleneck up. If running with CLI flags isn't really your thing and you like to use APIs instead, there's the inspector module inside node. It is similar to the CLI flags, but you can start and stop profile from code. This is an example of how you might do a CPO profile. We require the inspector module, create a new session and connect to it, then enable and start the profile, profiler. Do some tasks, then stop the profiler. And the file that is output can actually be loaded into the profile section of Chrome DevTools to get a more visual view of what's going on. Next up is trace events. The trace events module provides a mechanism to centralize tracing information generated by V8, node core and user code. Tracing can be enabled with the flag trace event categories, or by using the trace events module. The trace event category flag except a list of comma separated category names which allows users to create custom traces. By default, the node, node async hooks and the V8 categories are all enabled. And like the inspector module, running node.js within tracing enabled will produce log files that can be opened up into the tracing tab of Chrome DevTools. So in this example, we're creating a trace for the promise rejections category, which enables capture of trace data tracking the number of unhandled promise rejections and handle after rejections. There's actually a lot more categories. So after this talk, I would invite you to take a look at the trace events API docs and Node.js or to learn more. Next, we have the perf hooks module. These APIs allow developers to set various markers that make measuring the runtime of an application easier. This module provides an implementation of a subset of the web performance APIs as well as additional APIs for node specific performance measurements. Node supports the fine web performance APIs, high resolution time, performance timeline, and user timing. With these APIs, you can measure the time it takes individual dependencies to load and how long your app takes to initially start as well as determining how well the event loop is being utilized and if your asynchronous code is operating efficiently. Here's a quick example of how you might measure the duration of require operations to load dependencies. It uses the tumerify function, which is part of the performance API. So the site after this one assumes some knowledge of the Node.js event loop. So while we won't go too deep into all that right now, I just wanted to show the high level diagram of what it might look like. From a fresh start, new calls come in to the pull phase and traverse through the event loop. If we had code that was synchronous, we might block other code from being executed. So an important addition to the Perfux module in is event loop utilization, or ELU as we might call it from this point, which is added in the middle of the Node 12 and 14 life cycles, as well as being part of Node 16. The event loop utilization method returns an object that contains the cumulative duration of time the event loop has been both idle and active. ELU is similar to CPU utilization, except that it only measures event loop statistics and not CPU usage. It represents the percentage of time the event loop has spent outside the event loops provider. No other CPU idle time is taken into consideration. The following is an example of how mostly idle CPU process will have a high ELU. So in this example, we are synchronously spawning a sleep timeout, which blocks the event loop. So even though the CPU would be mostly idle, our event loop is blocked, which would slow down our application. So this is a great new addition to be able to determine where a problem might be. And of course, once you are ready to deploy your application to a container-based platform, it becomes a little more difficult to get access to those types of things. In this next section, we'll talk about getting environment, getting access to some of the data that I just mentioned when running inside a Kubernetes environment. So one of the most production-ready solutions for monitoring containerized applications is the open-source Prometheus toolkit. Prometheus is a mature and battle-tested monitoring and alerting tool that provides multiple features, such as dimensional data that can be identified by metric name and key value pairs, a very powerful and flexible query language so you can merge different metrics together to produce more complex and meaningful ones, multiple modes of graphing and dashboarding. Although for these, you'll have to use a third-party tool like Grafana, for example. And the nice thing about Prometheus is that you can find integration libraries for all sorts of languages and frameworks, which is really nice. Oh, and there is one more important detail. Prometheus is Prometheus's pull model. This means that you have to provide the collective metric somehow, like through an API endpoint or something. So the Prometheus can scrape and collect the desired metrics afterwards. And we'll see a way to automate this, actually, in the coming slides. Okay, so how do we expose some metrics to Prometheus from Node.js? We use a library called Prom Client that offers a great deal of features. First of all, Prom Client can provide, by default, many of the metrics that I mentioned before, like garbage collector metrics, event loop value metrics, et cetera. We can, of course, create our own metrics that we want to measure. And one nice thing about Prom Client is that it allows us to use a push gateway model if necessary. That means that we can directly send data to Prometheus without waiting for Prometheus to scrape them. Now, you may be wondering why this is useful. Well, think of the situation where we measure metrics from a batch or a background job. Unfortunately, the job does not leave enough for Prometheus to scrape the data. But for some reason, we really need to get those metrics to Prometheus. Now, this is where the push gateway comes in handy because the batch background job and push the metrics directly to Prometheus right before it finishes. So eventually, Prometheus will get the metrics that we want. In general, though, we use the push gateway method only on edge case scenarios. So here's a small code snippet of how you require the Prom Client module inside your application. This code is basically saying, hey, let me collect all the default metrics that you got. We're also adding a prefix to our metrics, which will help later when we want to see metrics for just our application. And if there was more than one application, it also would help determining which application metrics that you're looking at. And a few lines of code later, we expose those metrics in the metrics endpoint. This is a code snippet from the demo in a couple of minutes. And if we visit that metrics endpoint that we just saw from the browser, this is what we'll get. And of course, there's more information that is collected, but I'm only showing a small subset since the slides aren't that big. We can also see that each of the metrics that have been provided have our prefix that we added from the code snippet on the previous slide. If we have more than one application, you can see how this prefix could be quite useful. So I mentioned above that with Prom Client, we can create custom metrics if we want to for our application. But what kind of metrics are really useful? Enter red metrics. The red metrics method defines the three key metrics. Well, there's basically four, but the following three are the important ones. You should measure for every microservice in your architecture. These metrics are part of the four golden signal series defined by Google site reliability engineering. So a couple years back, Google published a book based on their experiences about what metrics help them maintain their huge infrastructure. And these metrics are rate, number of requests per second your services are serving, the number of fail requests per second, which are errors, and the distribute, yeah, it's very tough word to say, the distribution of the amount of time of each request takes, which would be the duration. We're monitoring all these three key metrics. You can deduce a lot of what your application performs in general. All right, so let's take a quick look at the demo. We did a recording just in case anything went wrong with the demo guides here. So we'll make that full screen and we'll pause it just real quick. Just to kind of explain what's going on a little bit. So the application is just a basic REST application, a REST application with two endpoints. The first is the API greeting endpoint, which will either return a small message or failure. Failure comes from a function that produces random failures for the purpose, just for the purpose of the demo. And the other endpoint is the metric endpoint that we saw earlier in that Prometheus needs that. Here I'm actually using OpenShift, which is a flavor of Kubernetes because it comes with Prometheus already built in. So I won't have to manually deploy Prometheus operator. And the whole goal of this demo is to measure the average request duration and graph it over the last five minutes. So I'll pause the video and let that play. Here we're using to deploy the application, we're using a CLI tool called NodeShift. And what this does is basically it will take your code that you've written not needing to know what a Docker file or anything like that is in a package it up, push it up to OpenShift, and then on OpenShift itself it will run the source to image build which also known as an S2I build which containerizes your application. And then once the application is containerized it pushes it into OpenShift's internal container registry and uses that to deploy the application. It also allows you to not have to write any YAML files and has some very sensible defaults. So really all you have to do is worry about your code and NodeShift will take care of the rest. So now that application is deployed and it's almost deployed it's still okay there we go it's running now that it's deployed we can get the URL and make sure that it's correct and go to the correct endpoint which is that API greeting and this should show and once we go there it should show us some Hello World message and let's see if we refresh a couple times we'll see that we also get those error messages that we mentioned there was there we go random failure 1, random failure 2 and there's also the the metrics endpoint that will provide the metrics of our application so we can scroll past all the default ones to the bottom where we have our custom metrics that we want to provide to Prometheus which is all those HTTP related metrics that we've collected so clicking on the monitoring section we can get some basic information things like memory usage we can reduce the time frame there we go memory usage we can reduce the time frame to zoom out a little bit I'll pause real quick here to clarify that these that while these metrics these are metrics there is one important detail that all the metrics here we're seeing are coming from OpenShift itself since we haven't yet activated Prometheus and to get Prometheus to start scraping our metrics we'll need to deploy a service monitor YAML so we'll do that right in about a second once we go over there do that so here I'll have to deploy a service monitor the service monitor is really just a 10 line YAML that we specify the metrics endpoint that Prometheus will scrape there's a scrape interval in there and some other things and after this demo we'll take a look at what actually that service monitor will look like it might just take a couple seconds based on the OpenShift that we're using there we go so let's put the service monitor yep let's take a couple seconds now that the service monitor is deployed we're going to use a patchy bench to stress test endpoint a little bit so we're going to do 5000 requests at 100 requests every time and we have to make sure we do to the API grading this shouldn't really shouldn't take too long it's a pretty it's a pretty quick operation alright once that's done we can go back to our metrics view and add a custom a custom query and while this custom query does look a little weird it's just a mean average calculation with the metrics that we've collected and we can zoom into a 5 minute portion of the graph to see that there are two lines there we go there's two lines there one success once 500 error and that is because of the either we're getting back hello world or we're getting back an error message so we can pause our that's the end of the demo we can pause it there continue on there we go so like I said we had to in order to have Prometheus activated and getting all of our metrics from our application we have to create this thing called service monitor and as we as we can see we have our endpoint that says hey every 30 seconds we're going to do a scrape and we're going to match this particular project with this my app label which we added a lot earlier so while this isn't necessarily observability tool that's built into node at all I did quickly want to mention the effort that Red Hat and IBM have started to define types of thing developers could be adding to the applications to be the best that they can be for example we're in the process of putting together an observability and metrics section what this is is a reference architecture and it's not to be a end all be all of this is how you should do things it's the opinions that are here are coming from our experiences from Red Hat and IBM from our customer interactions and the repo isn't open for all those who would like to contribute if you go to NodeShift.dev there should be links to go to the reference architecture and there's a bunch of other sections that we're starting to work on which would be great to have some other feedback alright just so to wrap up there's some some articles you can check out like Prometheus the prompt client the reference architecture going to NodeShift.dev there's a great blog post written by one of our colleagues about monitoring Node applications on OpenShift so I would definitely go give that a Google and to check that out and thank you very much and