 OK, well, this is already the best panel discussion that I've ever participated in, from hieroglyphs to encyclopedias. It's been great. But yeah, we should get started. So this is application monitoring. My name is Ben Sternthal. I'm a program director at OpenJS, and I work with Robin, a Robin. And I've been with the Foundation for a little over a month, and it's been really great. I'm here with you and really happy to meet these fine folks that we're going to be paneling with. So yeah, I'll let Zoe and Steven introduce themselves. So. Hi there. I'm Steven. I work at Datadog. I've been doing APM stuff for like about a decade, building all sorts of diagnostics tools. It's fun. My name is Zoe Sankamp, and I'm a developer advocate at ImFlexData, which is a time-series database company. I started there as a front-end software engineer in our React side, and now I work as a dev rel. So we got some questions we want to go through, but this is a pretty intimate space here. So we're thinking we'll try and leave some time at the end for a question and answer with folks. But that being said, so let's start off with some of the basics. So let's pretend, hypothetically, that I don't know anything about application monitoring as hard as that is to believe that I'm not an expert. So my first question here is just, you know, how would you define this and why is it something that's important? Why should I care? Application monitoring is pretty much just about knowing what your system is doing and why, particularly for like, every application is going to do something wrong at some point. You told it to do the wrong thing. You wrote a thing not quite right. Some user is trying to hack you or a billion reasons why. A thing cannot do what you think it's going to do, and so it helps to have continuous monitoring for, like, someday there's going to be an edge case. You can have a route that, like, you think this is like, oh, yeah, I'm just surfing a static file or something. Nothing could ever go wrong here, and then someone's going to send a weird head or something and just crash the entire server. It's like, if you aren't looking at it, you won't see why that happens. Right, right. I would say we normally define application monitoring as building a better user experience. That's the big thing is to make sure that your end users are happy, which, yes, normally involves things not crashing. They're, you know, fiddling around as much as they want to and they're not having any issues as they do that. They can click all the buttons and they all go green. It's a great day. And why it's really become more important, especially over the past couple of years, is once upon a time when Wikipedia began, we were a little more forgiving of things not quite working. You'd be willing to wait a little bit while your website loaded and maybe something broke every now and then and you'd tolerate it. Nowadays, you don't tolerate if your apps don't open immediately and if they open and they have weirdness going on, you just close them and you might even delete them. There's no tolerance anymore for things not to just run smoothly. And especially nowadays in businesses, there's no tolerance at all. When GitHub goes down, it's like a major problem. An entire company's, now all their developers are just kind of sitting there like, I can't believe we relied upon this system. Damn them all. Even if they've only crashed, you know, once every five years or something, it's still aggravating on that day of. So that's also why it's become so important now, especially in the past, like I said, five to ten years. I slightly missed the, like, old under construction pages. But when things go bad, you can just put that up and be like, yeah, yeah. We're progress is happening for sure. So for, say, new engineers that maybe aren't working on something that's as important as GitHub, but still important that that deserves monitoring, like what kind of advice do you have them when they're starting out on thinking about how to monitor their applications? I would say one thing to keep in mind when you first get started is think about how much of, I'm going to call it the triangle. Think about how much of the triangle you have. That's going to be your time, money, and developer energy is what I'm going to call it. Think about how much of those three things that you have. Because sometimes at the beginning, you actually have quite a lot of time and developer energy because you're just working straight up in the project and you just have the time to do it. In which case, you can look more to tooling that might be a little bit more open source, a little more custom, a little bit more work. But obviously at the beginning stages, I hear a lot of people say, I don't want to pay for this. I don't even have any customers or I have very few customers. So that's a great solution in that time zone or that place where your company's at. But if you're a bit bigger, and I will say this, most bigger companies are monitoring quite a few things. They don't get big without monitoring. I can assure you of that because then they crash. At that point in time, it becomes a lot more about what solutions can I find that meet my needs immediately. My developers do not have time for this. And they want solutions that work right out of the box. So that's definitely something to keep in mind is how much of this you can't afford to put towards it. But do also keep in mind, I would rather have a monitoring system kept together with duct tape and hope than absolutely nothing because nothing is the worst case. Yeah, the application performance monitoring space is really giant, wide space full of huge variety of products. And that can be intimidating to new users. There's a lot of focus on particular types of things. Tracing, every company forever has tried to sell tracing as the core thing. But actually, tracing is the thing that bigger companies that really understand what's going on want. But for most companies, actually profiling is more what you want. Just show me statistically what the thing is doing most of the time. I'm going to look at the top couple things and make those a little faster. And that's the extent of the work that most companies are going to do. But yeah, a lot of the tools in APM are there for the more advanced organizations that are going to dive deep into everything. And we kind of need to have a starting point and then start exploring the other tools and get to that eventually. So what do you see as some of the common mistakes that people make early on when they're implementing these systems? Well, one thing I had to pick, top three. My top one, for sure, is literally every company everywhere wants to have multiple APMs running at the same time. And currently, at least in Node, they all just monkey patch everything. And they don't tend to play nice with each other. So everything breaks. Yeah, I'm working on that. It's a long, long effort to fix. But yeah, there's instability issues. There's, like I said before, a lot of users don't understand most of the products. And because a lot of companies kind of pitch tracing is the core thing, they kind of look straight to that. And then depending on who the specific person is, it's going to actually look at the UI. It may or may not make sense to them. Different organizations will have different people actually looking, deploying this and looking at it. Some companies will have their dev team will be responsible for setting up the APM. And so they will have the context of, oh yeah, what routes does the service have? I can look at the trace for this particular route and that sort of thing. But in other cases, you might have some DevOps person just set this up. And they don't actually know what the service does. They just want to know, is this thing still running? Are the other requests that go in coming out the other end is like, they want more high level view of it but then gets presented to them in the wrong way for what they want. So you kind of need to, in my opinion, you kind of always need to tailor it for each user and have a discussion with the user to figure out what part of the product do you actually want. Let's show you how to use that. Right. This is the same question you had to pick. So first I want to kind of tail on to what you were saying. For example, your security team probably cares less about traces and a lot more about logs, for example. They want to know if they're being attacked. And all of a sudden, the sign-in logs are going crazy because you're being attacked by a phishing scam or something like that. And so that is one big thing that I see bigger organizations especially make is that they're not involving all the right people, all the right teams who are going to need to look at this data. And that kind of leads to the two other big things, which are kind of connected in a weird way. But that's monitoring too little. Like you're picking the very slimmest amount of metrics you can get. And so you're not getting a very good, holistic view of how everything's working together. So that can be kind of connected. But at the same time, monitoring too much metrics can also be an issue. But that's something that you kind of always like, outside of if there's already a predefined, like you're in a very predefined space onto what you should be monitoring, that's just going to be something you have to deal with as a developer, as a DevOps person, is kind of testing the waters and figuring out what works for you and your company. Yeah. Yeah, I also find there's a bit of a noise problem with some of the providers, depending on how you configure it, you're going to get different amounts of data. But you can configure APMs to be quite a firehose in some cases. And the users kind of have to learn a lot of different things. If you just go and turn on all the things, security is definitely an interesting one. There's always new security monitoring stuff being added to things. We need to look at this kind of case. Maybe like SQL injections, we're going to look for that. We need to look for prototype pollution things. All sorts of different things. And even language specific, there's a lot of these. We need to look at this particular pattern. And it's like, yeah, we can capture that and make metrics for that or alerts for that or whatever. But we also have to talk to the users and explain to them, yeah, this is a thing we're going to warn you a bunch about. But you still need to know what the thing is we're telling you about. So yeah. So getting language specific for JavaScript applications, what do you think are the most important things that folks should pay attention to out of the wide number of metrics that are out there? I would say, so one thing, especially if somebody worked not so much as a back end Node.js engineer, I worked, like I said, mainly on the front end to react, we were dealing a lot with performance metrics around things like I said before, like user experience, latency, making sure that websites loaded in properly. Also a big one that we faced in general as a company, I've heard this from other front end engineers as well, is third party apps being integrated in. And then they could occasionally go haywire. And then next thing you know, it's just slowing down your website like crazy, or it's again, possibly a security issue as well. But mainly we were just monitoring to make sure that our end users were relatively, you know, they were having a good experience. Things seemed to be loading at what we considered to be an appropriate time. Things came in right. And some of that was also like, I think we did a lot of monitoring in Google Analytics especially back in the day with all of our dashboards and such. And we would be checking how often, like buttons and such were being pressed because we had rough ideas on what we were expecting our relative users to do. So we could actually tell when things were broken because we were getting metrics that we were not expecting. Like all of a sudden, a button that's pressed 10,000 times a day is pressed zero. Or rather that request, that HTTP request, was no longer going. And you're like, OK, it's probably like, it was missing off the page. It was like a weird CSS bug that got up. And the button was no longer working. And you're like, well, yeah, it's kind of obvious in the metrics that it went from 10,000 to zero. You're like, that's a definite, something's not quite right. Yep, that kind of connects to like, my experience is like, most APM customers, like their biggest issue is that they're oftentimes like big orgs that they have like probably thousands of different services deployed. They oftentimes don't even know what all of them are or what versions they are. Like most users are going to be on like super old versions of things in some corners and modern versions of things in other corners. And they don't realize like, oh, yeah, we're actually using this thing that went to end of life seven years ago, something like that. Or they're going to use vulnerable versions of libraries and things like that and not even realize that. And oftentimes these companies, they won't be aware of this. And we will make them aware of it basically by just pointing out like, oh, yeah, you have this vulnerable version of something over there. And then they go and update the thing. But that sort of thing can come out of like, oh, yeah, I updated the thing, but I didn't do it right. And now a thing disappeared. And having the observability there to see, like, oh, yeah, I thought I was just making this simple change. But actually there's a compatibility issue between the super old thing and the new thing that I didn't even consider. It's a good segue to thinking about, and this is the next question, is about how this type of monitoring can help with security and regulatory compliance. Security has obviously always been very important, but much more so. And regulatory now is becoming even more of a thing that people need to think about. So where does server monitoring fit in? I would say, especially we've seen this with our customers. And just in general, whenever I read about security stuff, it's become really important. Certain things like HIPAA, for example, they practically require you to do a certain level of monitoring. And that's actually the case with most security protocols, is they expect you to be monitoring for certain types of attacks or certain types of security vulnerabilities. So if you're not monitoring in what they consider to be an adequate amount, you actually can half the time lose that kind of like, you are no longer in compliance, basically. Like it's expected that you will do base minimal monitoring. And one really great example of how this is used in the real world, like how we all interact especially, is with credit cards. Credit card companies are required to do a lot of monitoring. So if you've ever had your credit card stolen physically, or someone wrote down your numbers and went ahead and gone shopping, at least for me, I've had it happen. And they caught it in the first purchase. Because they are required to do this kind of monitoring. They're required to keep track of me as a person, in both a good way and a creepy way. But basically, they are using those monitoring tools to make sure that your data is safe and that you are, in general, being treated well as a customer of them, I suppose you could say, and not having your money stolen while somebody goes on a shopping spree. But that's like a very common use case. Yep, yeah, there's lots of different things. Like in most log monitoring products, they're gonna look for certain patterns, like credit card numbers, get logged in logs all the time. And yeah, that's a big problem. But yeah, a lot of those regulatory requirements for health things or credit cards, or if you're doing anything with the government, there's just a lot of really strict security requirements. Most of them, you have to know where every single bit of traffic is coming from. Every connection to your servers, you have to have a log of where they connected from. All the information about that. Oftentimes, you have to have HTTP traffic, a log of, yeah, this IP address was responsible for these specific requests and be able to provide reports that like, oh yeah, if we get attacked or something like that, you're going to have to do a post-mortem later and look up, oh yeah, the attacks came from this IP address so we have to go through and look at what things did they request so we know which things did they get access to? And you have to have, often times there's network, like outside of the JavaScript world, yeah, you need to have network monitoring and all sorts of different things to know what traffic is going on, if someone gets into a container, there's often times monitoring or like, did they get out of the container into the host and all sorts of things like this that when you're dealing with the regulatory stuff, it gets very complicated. Yeah, so I think it's time to look into the future. So gaze into your crystal ball and try and see what's happening in this space and what's going to develop in this space in the future. Near future and far future, right? This feels like an area where AI, I mean, AI is gonna have tendrils into everything but I'm curious to hear your thoughts on what you think is gonna happen in this space in the near future and far future. I think in the near future, you're gonna see a lot more stuff about security actually, quite a bit more because as we've all been watching the news over the past two years especially, you just see a lot more attacks and a lot more companies are getting a lot more cautious and end users as well. End users are becoming a lot more cognizant of like I don't wanna give my information to a company that's gonna go lose it and get hacked kinda deal. The other thing when it comes to AI, I have to admit, I'm not quite sure where it's going but I do know it's probably going to start asking questions that we have never asked. It's probably gonna be the big thing. It's probably going to start asking monitoring questions that we haven't even thought about asking yet and it will probably also be able to do things like predictive maintenance and predictive to make sure that there's no downtime and such and then another big thing that I'm noticing is we keep trending more towards like microservices which is both, it's a good thing, it's not a bad thing but it does make doing application monitoring a little bit more tricky especially I've heard in Node.js serve environments it's even more tricky. Apparently there's some missing pieces that could be improved basically to make that a bit easier and so I think that'll be another big thing that we see going forward is more observability tools built around that architecture. Yeah, I see the future kind of like a big part of it being like more dynamic nature to how we observe things. So like, there's a lot of interesting insights that you can get from things but they're like too expensive to do continuously. And so like, like- Can you give an example? Uh, like if you, you know, wanted to like, what's a good example? If you're just like trying to capture like, you know, like every single promise in the Node runtime like there's a lot of them. If you wanna like get the stat, like if you wanted to get the stack trace of like where did every single one of these promises come from? Like you're just gonna take down the service. But it's like, you can look at something like you can look at patterns in the application behavior to see like, oh yeah, like there's, like there's a, I don't know, like CPU spike or something often when it calls this thing. So I'm gonna like turn on this extra heavy thing just in this particular case, just to like look at that. And like there's, like I think gonna be more of that and like I think like one interesting thing that's come up is like EBPF, it's like having like external tracing, like you're listening to like kernel events and things like that. But like you can like at runtime turn this on, gather a bit of data and turn it off, like just at any time. It's like, I think we're gonna see a lot more of just like more generic observability things. Less of like a thing focused on the particular language, but more like a generic thing that like we're going to start to see these generic things are starting to understand the patterns of runtimes. It's like, like node, like there's V8 in there. And like that's generating a bunch of native code. That's really hard to look at now because no one's really programmed anything to understand how JIT, how like the JIT and V8 is laid out and like how to look at that memory and like symbolize things to know like the name of this function location and all that is this. Like that currently doesn't really exist, but I think like external tools are gonna start to like get that information and like we're going to have just like powerful tools that are just, I'm gonna look at everything on the system and see like, oh yeah, this is like some random C thing, but I can like infer from the like EBPF, like what the rough structure of this thing is and like identify like, oh yeah, like this is a Redis servers. I'm gonna like look at it in this way. This is like a node servers. I'm gonna look at it in this way. This is like something else and it's gonna like just funnel all of this from the whole system to give you more of like a system level observability thing rather than just application level. Okay, so I think we're gonna do one last question before we open it up to Q and A and so that would just be final thoughts or pieces of advice for folks. My final thought slash piece of advice is that I know especially even from this panel, it can sound like monitoring, application monitoring can be a little bit of a headache or it's just a lot to take in. It's a lot to go and research and look into but it's still extremely valuable. I always like to think it's kind of like brushing your teeth every day. Yeah, you may be like your favorite ritual to do at nights for some of us but you are sure as heck gonna regret it if you don't. This is the exact same thing. It might be a little painful at first and just like brushing your teeth you do occasionally have to go and check your monitoring tools. They are, it's not a completely passive process. It's a continuous process you come back to but it will still pay off a lot more in the end and it's just important to get it right somewhat in the beginning before, as I told one person, yeah, but once you have the customers and your stuff starts breaking, how long do you think they're going to stay your customers? Cause again, there's no room for error really anymore even if you're a small company. Okay, so questions from the audience for our experts. So it didn't come up here but obviously getting the instrumentation and metrics and traces and all that stuff is important but the next most important thing is probably informing your developers with something like alerts. I'm wondering how you all think about that from APM perspectives and the work you've done and if there's any particular strategies you would recommend for approaching alerting on top of the monitoring. I think that's definitely an area where AI is gonna start to shine a lot. It was like, APM is kind of a data fire hose and there's a lot of things that could possibly be like an anomaly in some way or another that you'd care about. It's very difficult to just define, oh yeah, if this particular thing happens then tell me about that because there could be a thousand possible things. A lot of systems now, that's the way it works is you set up like, oh yeah, if CPU goes over X then send me an email. That works okay but yeah, we need to do better about understanding more of what is actually going on to see like, oh yeah, this was normal traffic behavior and this is not. It is different in these ways. Maybe that is important and we need to go and look at that and figure out why it's different or maybe it's just a high traffic day, who knows but you probably wanna know about it either way. Yeah, I would say that at least at my company currently we have a lot of integrations with things like Slack and PagerDuty and really where that leaves I suppose you could say like a gap to an extent, it's not even really a gap. It's more like it's a user error, that's a better way to put it. What that leads to is the fact that the user has to know that the DevOps guy needs to be called when the CPU goes over or it needs to be somebody else and what you can do is what actually my company does which we just have an incidence channel which pretty much all of our engineering department including myself lives in and so when there's an incident it just goes to the channel and then multiple on-call people can come and they can figure it out amongst themselves to an extent who's who but you know that can be kinda complicated if you're dealing with a larger company like at a larger company I would not suggest that as the solution at ours at Mike's work but definitely not bigger and even some of our customers they deal with real IoT devices in the field and they actually need to text on the field like people who are working in an indoor farming facility that is not somebody who's on slack, that's somebody who needs to receive a text message when the pressure monitor on the plant is doing weird things, it's a very much like you need to know who to contact kinda problem like yes you need to know what's going wrong but then who the heck are we contacting about this problem and do we need to contact multiple people about the problem but definitely make sure I suppose that when you do application monitoring yeah do keep in mind that you need alerting systems like you need to send a text, a slack, an email if you're really risky you know you gotta pick something and start contacting people cause when things go bad if it's just screaming into the abyss it doesn't help anybody. That'd be a great t-shirt for a company. One thing to share at Netflix we have a single channel it does actually work pretty well, surprisingly. So anyway. Hey if it works don't break it if it works you know. Okay we have two I'm gonna go to the guy in the back real quick and we have three. Thank you. So gotta follow on on brush your teeth right or else. Do you see evolution of those tools where we have better impact prediction cause I know some tools you know they show like okay impact X amount of users affected but they kind of tucked away in the corner somewhere and it's really hard to see but more on the messaging it's like okay well if you don't respond to this within the next 30 minutes right your impact is gonna grow. You see tools evolving into this and I do a better job with prediction of what is gonna cost to not act on time. So at least for me personally my work experience doesn't revolve quite as much around the impact per se but I could see that becoming a lot more of a common thing with AI integration that they would be able to make a prediction based on how much users are normally there and how long it estimates something to take to fix maybe in also the future AI will be a little smart about like figuring out where the bug might have come from maybe it will go and check commit logs and such and it will be like this suspiciously happened one minute after this commit went up I have my suspicions on this and yeah I do think possibly that will become more of a thing in the future and it could also even be updating like have you guys ever seen like when a site's down they have like a down website where they're like hey we're working on it here's the like estimation on when it will be fixed. I wouldn't be shocked if kind of like the website you mentioned earlier like we're under construction like it could intelligently like put up a warning message that's like hey we know that there's an issue here we're working on it which I have actually seen a few times but I'm almost certain that's not AI I'm almost certain that was like a real person who had to push that up and be like all right put up the broken message we gotta let them know. Yeah there is a lot of products are starting to get into the like cost analytics thing which I think is like kind of connects to that. Most of them are more so like just covering like pretty much like telling you your AWS bill all over again but some of them are starting to like get into capturing like oh yeah like these things are responsible for like this like branch of your traffic like I don't know like if you have like a I don't know shopping cart system or something and like the checkout thing is gonna show like oh yeah like this percentage of my traffic that like I can attribute to these users like came from the shopping cart and so like you can like connect that to like more like metrics and things like that to capture like oh yeah like these users have like spent this amount of money going through like this part of the system and so we can like try and look at like oh yeah like if this thing goes down then that's gonna cost us like this amount of money so. Hi so AI comes into a picture of performance management there is a lot of data that we would be looking in the past across different stacks so with the evolution of AI there will always be a trade-off between cost and memory so how do we plan on addressing that in the future? With most performance monitoring products like it's pretty much always just about sampling it's like we're not gonna look at like everything ever but you just have to be kind of intelligent about how you're sampling things so like most APMs they will have a thing like you can have like sample rate rules per route or things like that and it's like you can say like oh yeah like this other thing like doesn't happen very often but I wanna like still like haven't enough information about that like even if like it doesn't actually fit the overall sampling rate just like give this like every time this gets hit like sample rate of 100 like 100% just because like that thing is important or like things like that yeah just try to like mess with the configurations of things basically just to say like oh yeah like I really need to know about these specific things so just turn up the volume on these I guess okay thank you I think earlier in the panel you brought up that some people actually leverage multiple APMs all at once like how? Like do people? Badly like and what is to gain from like if anything from all these different like I don't know I think like the Node.js context like how much like what different information would data dog give me from someone else if you're just hooking into async hooks no matter what? So there's lots of different reasons I've seen companies have more than one APM there's like sometimes there's like different APMs have slightly different features like one is like more error reporting oriented and now the one's more tracing oriented you also often get like most APMs are kind of expensive data dog included so like customers are sometimes gonna like have like multiples installed to kind of have like a bit of leverage like oh like we might switch to these other guys cause your bill's a bit much do that sort of thing and there's like sometimes there's like sometimes like decisions for installing an APM don't come from like that team it comes from like someone higher up in the organization and like it can even be like like this director over here decided like we're gonna use this product and this director over here decided we're gonna use this product and like you can actually sometimes get like two different directors have said like you have to use this like these two different products and like the team's just like fine, whatever that's what you told me to do so I'll do it I guess thank you perfect I'll take us home so something that was mentioned earlier that I found super interesting like really resonated with me is this idea of trying to better understand like the system level behavior I think like once you get to a certain size or like say you have a good culture of instrumentation then really the next sort of common problem you see is like a team gets paged at 3 a.m. and then like the adjacent team gets paged and after that it's like four other teams get paged and you have like five people sitting around trying to figure out like something just happened they're trying to figure out something something's going on but what's actually going on and so you need to be able to like draw you know relationships or like infer causes and it's just between a lot of different things here how do you all kind of see the APM I guess the ecosystem if you will evolving to kind of meet that need yep and we'll like historically like like we call ourselves like application performance monitoring for a long time and then like a lot of the industry is kind of like tried to shift the like image to like observability now and like part of the reasoning for that is like your application is not just your code it is your code living on a system that has a bunch of stuff running on it like even if you like try and like isolate stuff and like this is a Docker container with just this and like even if you like only run this single Docker container on this hardware like there can still be like weird stuff in there like unless you like strip things down to like micro kernels system or something like that you can have like Linux services like misbehaving randomly sometimes like all sorts of like different things that like can go wrong like usually it's fine but like occasionally like things at like at that level will influence the behavior of your of like your actual like application code and so like having observability of like the whole thing and not just like this little slice of memory that this process happens to own like can be valuable kind of depends on like the depth of like like what you do and like your ability to actually do anything about it like if something is like oh yeah there's like a bug in some like version of Linux or something there a lot of people's fix is gonna be let's just update the thing or like roll back or something but like yeah so some some some of the users that can do more like might be like oh yeah we're gonna like make a kernel patch or something to fix this like right now like yeah it there's value there but it depends on like what your scale is right. I think that's it for today. Awesome. Thank you so much for the presentation. Thanks to the panel. Thank you.