 Cool. Well, welcome everybody. My first time at a DrupalCon so far, it's been a blast. Today we're having, so today this session today is monitoring 101, finding signal in the noise. And I'm Elon Rubinovich with DataDog. Just a quick background on myself before we, you know, as we're getting started here. So I've been with DataDog almost a year now. Before that I was a customer for a number of years. My interests tend to be focused on evangelizing or advocating for open source in various forms. I've been involved with a planning scale in Texas Linux Fest and other events like that. Ron in the front row there is one of my co-organizers. I tend to work in large scale web operations, places like Edmunds.com or Viola building automation tooling, trying to build, trying, sometimes succeeding, sometimes failing at building my own monitoring systems, eventually landing on DataDog. And so I'm really big on monitoring and metrics and collecting data about our environment. How many folks are familiar with DataDog and what we do? Heard of it before? Okay, a couple hands. So we're a SaaS based infrastructure and application monitoring platform. That's a lot of words. So basically we help monitor your applications and your infrastructure and your services make sure that they're online, collect metrics about both their performance as well as their availability. It's based on an open source agent that runs on your web servers or your app servers or other components in your infrastructure, collects metrics locally, sends them back to us as time series data. So that's metrics or events and we'll get into a little bit more about what those are in a second. At this point we're processing about a trillion data points per day. So as we start to talk about how many metrics are in your environment, you can imagine multiply that out by our thousands of customers. We do things like intelligent alerting, dashboarding, help you make sense of your data. I also mentioned that we're hiring. If you're interested in talking about any of these types of things, happy to chat with you after the session. So what I mean by monitoring everything and we're talking about everything from your cloud providers to your containers to your operating systems, even other monitoring systems. So you'll see things like New Relic and Nagios up there. We believe your data is most useful when it's all together in one place. So that's our focus is helping you bring that together. And the goal is to help you identify and address critical monitoring problems, critical infrastructure and business problems like this. There's clearly a load balancing problem here somewhere with this bull and we've helped you want to help identify these things before your customers notice it. So the quick plan for the next 45 minutes or so, we'll do a quick introduction and background on what is DevOps or at least what I think DevOps is. We all have differing opinions and definitions for it. But I think it sort of defines the why metrics are important. We'll talk about the challenge of what dynamic infrastructure looks like today as you're bringing in things like Docker and Amazon and Google Cloud and all the different providers that you're interacting with. We'll talk a little about some frameworks that we've come up with as we talk with other industry experts and with our customers and how we recommend thinking about your data and then we'll do some examples from other applications. So the first area, the first thing is, of course, we need to define our focus area. We're here in the DevOps track. So what does DevOps mean? These two very intelligent men, Damon Edwards and John Willis, were at DevOps day LA many years back and they coined the acronym CAMS as sort of the four pillars of DevOps. So that comes from the CAMS stands for Culture, Automation, Metrics and Sharing. So let's talk a little bit about what we mean by that. Culture, really if you have a top-down organization or your teams are not collaborating and not working together, it doesn't matter how many tools I give you, it doesn't matter how many tools you go and build yourself, not going to get you anywhere. So as Conway said, the systems that we build are going to reflect the structure of our organization. That also means they're going to reflect the dysfunctions of our organization if they exist. If you think about something like this, right, there's not a ton of like, I mean there's a ton of technology there, but it's not a ton of modern technology and this still works hundreds of years after it started and they're still, and they're quite efficient at it. And as, you know, there's some, to some extent it has to do with the tooling, but really, you know, this is not modern automation. What it comes down to is the fact that they're working together collaboratively. Now I know some people say DevOps is this thing where, like everybody does everything and we're all generalists. That's not necessarily true to each organization their own. The idea is that we're all collaborating and we're all working together. And in this case, you've got folks that actually have very specific roles. The guys that like do the joysting are doing something like they're very, they're the only guys that get to do that. Like it's not, it's not, you don't wake up one day and say I'm going to do that role, right? And so there's room for all flavors of this in your organization. And again, like at the end of the day, there's no tool that you can buy that will give you the DevOps and say that, or give you the DevOps title. Tools are intended to support your culture. And so once you have the culture in place, yeah, go buy yourself some modern tooling, get some cranes, get some tractors, get some forklifts, build yourself a really fancy barn, or put some containers on a ship somewhere and do it fast. So the next category, of course, is metrics. Sorry, it was automation. That's kind of what I was referring to with the tooling. This is the Chef Puppet Ansibles of the World. All the tools that we're using today to bring sort of software practices into our operations these days. And these are to help, again, support your culture and help you support your culture and accelerate that work that you're trying to do as a team. So we've got culture and automation. Metrics, of course, we can't really see where we're going. We can't really plan our course of where we're going if we can't see where we are right now and where we would like to be. So that's where metrics are at. You wouldn't drive down the street down the highway with your wipers off in the rain or without your headlights out in the dark. And so it's a similar idea. You really don't want to be this guy when you're doing your postmortems, right? Do folks do postmortems in their organizations? I'm seeing nods in some hands. Cool. So yeah, postmortems are a great way to learn from your mistakes. But as you're doing these and as you're doing your incident reviews, if you weren't collecting the data to begin with, you're in a pretty tough situation. I know I've at least been one or two of these where the only action item I could come up with is I should probably, you know, we'll monitor it better next time. So it doesn't happen a third time, but if it happens a second time, we'll know why. And so you want to avoid that. Because you're going to crash, you're going to have issues. None of us builds perfect software. None of us is immune from making mistakes, but at least we'll be able to catch it if we have metrics. Now metrics doesn't just mean monitoring your infrastructure. It also means monitoring your team. So that might be monitoring the velocity of your sprints and how faster engineers are producing code. This is an example of a dashboard that we make available to our customers that shows, you know, incidents. What's the most noisy thing waking you up at night or distracting your team at a given point? In a previous life, I used to go into various engineering teams and consult with them about why they were getting 10,000 pages a week. And I'd say, like, how can you possibly make sense of all that noise and make decisions on it? You're not actually responding to these, are you? And they'd say, oh, of course we are. So we did the math and it turns out if they were, they probably weren't sleeping because that's more than a page, you know, a page a minute for the time, you know, for the period of that week. So again, it's, you want to measure all of these things and start to, and use them to track towards improvement. And then sharing, the last, I mean, all of this stuff is great in isolation, but if we're not sharing it across our organization, whether it be between our development and operation teams or between each of those teams in the business side, we're, you know, we're not going to be as successful. So this is really looping back on culture, making sure that we're getting the best of all of the work that we're doing together as a team, looking to learn together, describing the problems as the enemy rather than, you know, the other teams in the organization. If there's a memory leak, it's not the developers. Like, you know, you don't walk them, you're not walking up and blaming the developer for the fact that the application, you know, crashed from a memory leak. You're blaming the memory leak and you get together and you collaborate on how to solve that, right? You know, a good example of a story I like to tell around this is, you know, how many folks know what this is up here on the screen? This is the Mars climate orbiter. So sharing isn't just about sharing the data, but also about making sure that we're using the same language and that we're talking together in ways that each other understand. In the story here, you know, we had NASA engineers and we had Lockheed Martin using different systems of measures. So you had one team using customary and another team using metrics and it turns out when you convert, when you don't change your units, things crash into planets and, you know, waste millions if not billions of dollars. So it's, you know, it's a problem. But today we're here about talking, we're here about monitoring and so we'll talk about the metrics and sharing. I like to say that what we bring you is a set of tools that you can pour your culture and automation on and get that metrics, get that sharing. So collecting, we talked about before about the challenges of, you know, postmortems where you don't have the data. So remember that collecting that data is cheap when you have it. It's pretty damn expensive to find it, to try to recreate it later down the line. So, you know, no matter how much you might think, look at it and think to yourself, well, I only look at those metrics 20% of the time, I can probably cut them. That 20% that you look at them, they're probably the most important metrics that you will ever, you will ever be able to find and the rest of them won't matter. So storage is cheap these days, take advantage of it, collect all the data you can while you have it. So we say instrument all the things. But what does that really mean, right? We're in this world of the cloud and we think it's amazing, right? We have managed databases, we've got auto scaling, you know, we're not, I remember days when like somebody would say I want to do a product lunch and I'd say, great, give me a month to go get some servers from Dell or from HP and my team, we're going to run to the data center, we're going to put this stuff in the racks and like maybe we'll make the deadline for you. I don't have to do that anymore. All I have to say is do you have a credit card and what's the limit on it and we'll auto scale to what you would like it to do. Things will configure themselves magically most of the time with things like Chef or Puppet or you know, we've got things like orchestration management tooling like Kubernetes. We have infinite storage with things like S3 and EBS and yes, there's things from other cloud providers. Or even, you know, even within our own environments, we have things like OpenStack that let us do it. So that's, you know, that this is definitely, this is definitely sort of driving towards a sense of change. Back in October we did a study on Docker adoption. Our folks using Docker in the Drupal community quite a bit. And so what we found was that, you know, this, it's a technology that was growing pretty, it's spreading like wildfire. We saw 5x adoption in one year, went from like 0% of nodes that we monitored to 6% of nodes that we monitored being in some way Docker based. So it adds up pretty quickly. And this further contributes to that change all the time, right? You're sort of in this situation where, you know, between all the moving pieces underneath you kind of sometimes feel like you're on quicksand. And so got to figure out some ways to think about this. So, you know, what we started to do was think about our stack in these, in a couple different layers and kind of target, just kind of think about where you target your monitoring system, your monitoring, right? So at the top sort of this is where, you know, on the left hand side here you probably have like what your stack probably looked like 5, 10 years ago. Probably one application, maybe some off-the-shelf component sitting on a single server in Iraq somewhere. Good to go. Over, you know, over the period of time we started to do virtualization. So this is your KVMs and your VMwares and all these other tools that we have to make it faster to spin things up. And we thought, cool, we'll get more out of, we'll get more of the utilization of these hosts. We'll have two VMs on that same box or 10 VMs on the same box. Split the resources as we want them. Maybe not have things co-located. And then finally today we're sort of in this container world where you're doing like just enough OS, just enough to run that one web app or that one process. And that's on the right here. And so the goal is to take something like this, that's a little bespoke maybe each time, maybe not using up all the space in the back of that truck and look at something like this, right? Where it's all uniform, any crane can pick that up, drop it on any truck. You're not having a lot of confusion about how you're doing these deployments. But it also results in a lot of movement around in your environment. So remember that operational complexity is going to increase with the number of things that you have to measure and the velocity of change. So what is all of this cloud and container stuff doing to us? Well, assuming again, I'm assuming for a second that you're in Amazon but this, you know, your other providers will have similar numbers. You know, CloudWatch is going to have about 10 metrics per instance. Your operating system is probably about a hundred that you actually care about and collect. You'll probably have about 50 metrics from a given application. Maybe it's you know Apache or AngelNex or something like that. But we just talked about how popular containers are and how we're doing all that. And one of the things that we found is people are running somewhere upwards of four containers per, you know, per virtual machine or host that they're running. So that's now multiplying this out further. And so we get to multiply that out, you know, that number 150 times N plus the 10 metrics for the underlying host. And you start to get some interesting math here, right? Our hundred instances became 400 containers, the 160 metrics we had per host now became somewhere around 640. Multiply that out. It's about, you know, 64, 64,000 metrics. That's a lot to collect, especially if you're doing things that second granularity or at least sub-mini-granularity, which I hope you are because things are changing that fast in this world. And so it's a bit of metrics overloaded. So how do we, how do we think about all this? How do we know what to deal with? That's, that's the first part of it. We said, we said we'll talk about the number of things that we have to measure. So now how about the things that are changing, right? Well, you know, as we've mentioned, we're in this containerized and virtualized world. What we're finding is that an average host stands around for 12 days. You're not, you're no longer naming them after your favorite, you know, cartoon characters or planets in the solar system or something like that. These hosts are probably using some sort of a unique ID that changes all the time. You can't remember what the, you know, what that host was. So that's changing all the time. And then containers are even less so about three days on, on, on, as the median is what we're seeing. So we're now cycling through our stuff quite quickly and things that we used to do maybe every couple hours. We're now doing in minutes, days and hours. There's a lot of change underneath our feet. So, you know, it kind of turns out to be a little confusing at times as we're trying to figure out what to pay attention to here. And we need a little bit of a, a way, we need a way to think about, about this data, how to capture the signal and what, what, you know, process it and make some, make some useful decisions based on it. We need something modern. So we came up with a guide that we call Monitoring 101. This is going to be the sort of TLDR edition, but the longer version is up on our website and I encourage you all to read it. It applies regardless of the monitoring tool that you, what monitoring tool you happen to use, it's just a way to think about your, about your environment. So what we encourage you to do is break down your data, your metrics into three categories. The first being work metrics, the second resource metrics and final one being events. And what do we, you know, it's, so let's use an analogy. Let's say you're working in a factory, you're producing some object. You know, we'll go with cars. I unfortunately share the same name as Elon Musk, just spelled differently. So people always ask me about my Tesla's. Let's say we're manufacturing some Tesla's. So what's our work metric? Well, the work metric are those cars that I'm, you know, we're popping off the end of that assembly line. Throughput's going to be things like how many of those cars are we producing a day? Success versus error are going to be like how many, how many complete cars come out of that assembly line that we can actually sell versus how many lemons come off the line missing hubcaps or a cracked windshield or some other thing that's going to result in or unhappy customer. These are the things that people are actually paying us to do. Performance being sort of how efficient we are and how quickly we can turn around an individual unit there. Resources are all the pieces that are going into making those, making that product. So that might be, you know, utilization is how much more capacity do we have in our environment to do more of this thing? Error rates, again, similar concepts. And availability is going to, again, the availability of more of these resources to do more work. And then finally, the last category was events, right? These are things that we're doing to, that we're doing to change our, to change our environment. This is, this sort of provide context on why we're using more resources or why we're doing, producing less work output. You know, I might have gotten on stage and promised a car that doesn't actually exist yet. And now I'm telling you that you all have to build it all really quickly. Now you have to, you know, turn up the, turn up the rate on that assembly line. In your applications, it's probably going to be something like code deployments. So, let's, let's apply this to, to, you know, something we all, we all know here, web servers, Nginx, you know, I assume one of two web servers we're probably talking about at a conference like this, probably Nginx or Apache. So we use Nginx here for a second, but these metrics apply pretty much to any web application that you're interacting with. So work metrics are going to be, again, we're talking about the throughput. It's going to be things like requests per second. How many of those API calls that you're returning? How many pages is Drupal rendering for you right now? Performance is going to be that request time, right? How fast can we turn around one of those requests as we're doing, as we're under, you know, underneath that load from our customers? And then error rates, of course, or your two, you know, successes are like, you know, 200s versus 500s versus 400s. Are we returning the API, the API responses or the web pages that our customers want to see? On the resource side, it's all the things that you normally probably are monitoring on your systems today. This is where a lot of people have focused their efforts over the years. Is my disk full? How much, you know, am I hitting the limits on IO or network bandwidth? What's the queue look like in terms of the connection pools to my database? These are all interesting, but I would ask you, how many times has your boss called you up in the middle of the night to say, you know, the website's working really well, but the CPU usage is kind of high. Or how many times has a customer coming told you they're going to return? They want to credit because the CPU usage was high on your API servers. They don't know, they don't care. They just want to know that they can make those calls that the page is loading that they can get their job done. Events, again, these are providing context. So again, we're talking about things like configuration changes, whether that be something that you're doing or maybe your config management system is doing. Code deployments, maybe you put a new release out on the website. The button that lets you register for DrupalCon is now gone and you're wondering, why do I have less registrations per minute today than I did yesterday? Well, people can't find the page anymore. That's an event. Services is stalling and stopping. Maybe you upgraded engine acts, things like that. These are all things that provide context. Hopefully, cool. So what you want to with events, what you can do with events is sort of overlay them on your graphs and try to find context here. So what you're looking at here is each of these orange lines that go sort of vertically here are overlaid on a graph showing deployments and how that impacted IO on a given cluster. In this context, maybe not particularly exciting, but you can imagine how this would be useful to you as you're doing code deployments in your environment or making other changes, wanting to be able to see, do these things, did a deployment correlate with an outage or something else that or some other dip or spike in your metrics. So when do we let our and when do we let our engineers sleep or probably ourselves or I see I imagine there's a lot of engineers in the room here. Let's keep in mind that we don't we don't need to conflate alerts with paging people, right? Alerts can be very or quite useful as records that something occurred at a period of time. If it's low severity and it's not actually impacting your customers, but maybe it's an event that was problematic or something that you want to take a look at later. Use it as a record. Come back and check on it later during business hours. You know, notifications might be things like email. The next level up might be something that's a little bit more a little bit more critical might be some like a warning level that you were almost near some capacity about some capacity. Email on that. Don't page your engineers in the middle of the night. And then finally for for high severity alerts you want to wake somebody up. The thing is only wake somebody up if it's a symptom that's extremely urgent for your end users. Should be actionable every single time. The minute you start sending pages that are now actual even once you now have engineers that are questioning whether or not they should get out of bed and deal with the problem. They might just shove the phone under the pillow, hits news, move on. So again, when do we alert? Well, let's let's focus on those work metrics. Let's focus on those symptoms, the things that are actually impacting our customers and then use them, you know, all of the other bits the resource the resource metrics and the events that the events that we're seeing we'll use those for diagnosis, right? So alert again on things like throughput that we're before are we doing something is a given value much higher or lower on what we're outputting right now? We want to know that. Alert on, you know, an increase in error rates that might impact the success of what your customers are trying to do. If work is taking much longer to complete the website's taking 60 seconds to load and we know people get annoyed at milliseconds. Alert on that. But you know, don't alert on that high CPU usage because that's not going to tell you whether or not you're losing money. So as you get these alerts and you're interacting with this, what you're going to use it's sort of a recursive workflow. Every portion of your stack has some work metric. It's not just the website that your customer loads or the API that they call. You probably rely on a database or some other piece underneath you in the stack, especially in more complicated environments. And each one of those folks has a work metric of their own. If you're a DBA and your job is to get the database up, then your work metric is around how many SQL queries you're returning and how long they take to return. That's just as important as the API calls to the top. We each have our work metric and so I encourage you to find out what your work metric is for what you do in your job and the services that you're responsible for. So that is a lot of our current monitoring fit in. And it turns out that, as I mentioned before, a lot of us are really focused on resource metrics. We're really focused on things that maybe didn't change all the time. And so you think about things like how many folks in the rumors I asked before about people using data dog but how many folks are using things like Nagios or Cacti right now. Do you know when those were built originally? So Nagios came out in March 1999, Cacti, in September 2001. They're amazing tools. I have built many a monitoring system on top of them. I use them all the time. My question though is, do they apply maybe to today's, to what we're doing today and doing it in our environments today? And the reality is things have changed. We're not standing up those servers and leaving them there for 10 years. We're replacing them in days or minutes, as we were saying before. So the other thing to keep in mind is we have too many tools. How many times have you gotten an alert and then you go look at your trending system and it turns out that the graph and alert show something entirely different and you can't figure out how to correlate between the two? We have lots of point solutions. If you're using something like Kadoop or Cassandra, you're likely picking up a monitoring tool just for that. So how does this all fit together? Monitoring doesn't mean just collecting all the metrics and shoving it on a bunch of dashboards that you can't recognize or I'm getting tons of alerts all the time because you end up being like this guy and I'll move on before I give somebody a seizure, right? But we have too many tools. We want to be able to collect all of this data from as many sources as we can find and bring them into a single place that we can interact with and make some intelligent decisions based on that. Another question is cryptic alerts. One of the things I can't stand is getting an alert for some node I've never heard of wondering what's going on. Like I don't know what that means. I've never heard of I mean DB server one, is that something from my app tier? Is that something that's powering our registration system? What is that? So it's important that as we're doing our monitoring, we kind of take a step back and explain to our teams what's happening. An alert should contain all of the information that the person responding needs to address the issue. So again, something like our home page is taking more than five seconds to load. Our customers are going to be frustrated when it takes more than X seconds. You should go fix something. And here's links to dashboards. Here's links to runbooks. Maybe put the runbook in the alert. This is an example of something that we might do. We do when we monitor our own systems at Datadog, but we encourage our customers to do the same. When you wake up at three in the morning from an alert and you're trying to swap back into your brain all of the information about your infrastructure that had sort of been sitting on disk while you were dreaming about your vacation or some hobby that you have, like you don't want to waste five minutes responding to that incident trying to figure out what's going on while you get your cup of coffee, try to wake up and deal with it. So make sure that you're giving your engineers all the tools and resources that they need when they're responding right away. So again, why is it important? What should I do about it? Who do I call next if I get stuck? If I don't respond, what should this escalate to? All of that information should be in, as much as that information as possible should be in that alert or at least in a runbook that that alert links to that you can take a look at. Don't, you know, don't put your team in an awkward situation where they're doing discovery on the fly as these incidents are happening because, ideally, you're never having the same incident twice as it is and so what you want to be able to let them do is focus on the causes of those symptoms not trying to figure out why they should care about the symptoms. Next thing I'll mention is averages are lies. Stop with systems that do round robin databases when we're looking at our, you know, after an hour, things are chunked up at five minute averages and then after a day, they're at hour averages and after a week, you're looking at, you know, a week or a month, you're looking at day averages. When you try to figure out how much capacity you need for the last Super Bowl or the last Drupal Con when you announce some amazing feature and then you go to buy those servers or scale them up in the cloud and you realize that they weren't, you know, they realize that the average is nowhere near what your peak was and you haven't out it, you're not, you're not going to be happy. So keep the real data. We encourage you to try to keep it for about a year so you can do that year over year seasonality. So finally, all these things are changing all the time, right? So I don't care how fast your puppet or your chef runs are going or your ansible or your salt or whatever automation you've wrapped around your static Nagios configs or some other monitoring system. Again, I don't mean to beat up on Nagios. I actually use it. I've used it quite a bit over the years, but the point is these things can't, you know, if you're doing a five minute converge time, it's not going to be as fast as these hosts coming and going. You're going to be alerted when you spin something new up or shut something old down. So try to find something that's dynamic and meets your needs there. Otherwise, that page is just going to be going off all the time. Next thing is, you know, try to think about things from a, try to avoid thinking about things from a host-centric perspective, right? Your users don't care about the individual host. They care about this service. And so what we're looking at here as an, you know, as an example or analogy is that this is the Ptolemic model of the solar system, right? This is when we used to say that the everything in the solar system were all around the earth, right? How hard was it to come up with a math of where things were in the solar system at any given time? It's the same example for if you're trying to figure out how your service is performing or if it's available based on individual hosts. If you flip it around, you're in a much better situation. You can, it's much easier to come up with that math, right? These are much clearer lines. And even if this is not 100% accurate of how our solar system looks, it's much closer and much easier to think about it and interact with it. So next is try to collect as much metadata about your environment as you can. Tags are sort of key to modern monitoring. They let you answer questions about what and where things are in your environment. So everything in your environment is likely producing some amount of metadata, whether it be what application is on that host. It might be the run lists from your chef or puppet or some other configuration management system. Your infrastructure providers like cloud providers like Amazon and Google and what have you are all offering your way to provide, put metadata into their infrastructure. Pull that out. If you're using something like Docker or Kubernetes, they have labels. Pull that out. Bring that all together. And that lets you sort of make some, lets you start to slice and dice your metrics. So rather than just looking at what a given host's CPU might look like, you can start figuring out across a slice of your tier or a portion of your environment. So what do we mean by tags, right? Well, we talked about data points and the fact that we collect over nearly a trillion of them a day now. So when you send a metric to us, it looks like something like this. And this is true of any sort of time series monitoring system. You'll have the name of the metric. This is what it is that you're measuring. In this case, we're looking at bytes received on your network, a value of how much, you know, the time stamp. Probably something in, hopefully you're keeping this in seconds or milliseconds. And then the where. So, you know, we have a file server there. But ideally, you're going to attach a number of tags to this thing so you can slice and dice in different ways and across any dimension that you want. So, what do we mean by tags? Well, here's an example of sort of an environment where we have a number of things like databases and cache servers and app servers. And we have different instance sizings. That's each size of our instances. It could be a tag. What's running on those instances could be tags and where they are. So we have things split up against the various parts of Amazon's US East data center. Again, rows have crossed the top there that are different regions like Europe or the West Coast in the US. And so as we say, let's show me the performance across all things that have the tag T2 small. I can slice across my environment and get that together in a single way. So let's me do very rich questions about my organization, things like monitor all the servers that are running, you know, the application Drupal in US with two across all availability zones that are using more than, you know, a particular amount of resources on C3 extra larges, right? This is you want to be able to ask questions that are important to your business. Things like show me all of the 90% alert me when 90% of all the web requests on application Drupal are taking more than 0.5 seconds to process and respond in a given region. These are things that you actually care about, right? Not the CPU resource, not the CPU and memory resource that we were talking about earlier. So and finally, try to find some custom metrics from your own applications, right? A lot of times, you know, a lot of times there's things on both sides of this interaction as you're... Both sides of this interaction, both there are metrics that are coming from your underlying infrastructure, maybe as to, you know, how long a database query took to respond from your database server, but how long does your application see it as having taken? And maybe, you know, how long did it take to respond? Generate these, instrument your code with these custom metrics. There's a number of libraries for generating things like stats D. This is great. It's asynchronous. Your applications aren't going to block. You can just emit them and send them off to places like Datadog or Graphite or one of the many other monitoring systems. And it's nice because, again, they won't block. So the responses that you're sending your customers aren't waiting for these metrics to be generated and accepted on the other end. There's great libraries for things for this, like in languages like PHP. I've seen even... I've even seen stats D libraries in Haskell. I don't know who's building web applications in Haskell, but that's an option for you if you choose to. And so I think I rushed through this a bit because we're towards the end here. But if you follow these practices, you'll sort of stay out of the doghouse even when you're in containers and sort of keep your customers happy. We've got a lot of great resources on this online as well about metrics one-on-one. We've got monitoring one-on-one series things about how to collect this data, investigating how to use it to investigate your performance issues, as well as on individual technologies, everything from PHP to Java to Java apps to Kafka to what have you. So great, great guides up there. And happy to answer any questions. There's some great resources on this Monitoring Sucks project. It's a little old, but it's a fairly good inventory of the open-source monitoring tools that are out there, as well as some of the commercial ones. And so you can kind of see where some of these practices might fit in, both with Datadog or with some other tooling of your choice. Well, I'm happy to do any Q&A you might have. And please, that's the link to rate this session afterwards if you are so inclined. Hi. Could you go over the licensing of Datadog? Sure. So Datadog is a commercial offering where we charge per agent that you run on a given host that you run. So if you have 100 hosts and you've deployed our agent on 100 of those hosts, then we'll charge you at that amount. It's $15 per host per month. If you're using things like containers, you know, it's not one, it's not one, we don't consider each container a host, but rather each physical or virtual machine that you've spun up. So you can get a number of containers on a given host. So that agent will run locally. That's, and then we also have all of the sort of the SAS integrations that we do. We're pulling things from Amazon or Google or New Relic or other monitoring systems and other infrastructure systems. And those are, you know, that that's in most cases, those are not counted against your host count. So I say most. Don't quote me on that later, but it depends on the situation. So given that we live in a world of virtualization, containers running in virtual servers running in, what can you tell a metric that would that you find find powerful with regards to networking? Because that seems to be one of the the bottlenecks with virtualization, especially with Docker. So I guess I mean, when you talk about, I mean, when you say networking, I guess it's kind of important to know what aspect of networking you mean. Like, are you talking about your switches? Are you talking about, you know, roundtrip latency to your customers? You know what? I would say roundtrip latency to customers. So I mean, there's so. I mean, I think that's the metric of it. You know, the most important metric to me is how many, how long is it taking me to serve an individual request that a customer has made? Is it, you know, whether it be a blood page loading or an API call responding? So like, let's say that your customers were getting degraded response time, but it was coming from an application that was running in a Docker container that was running in EC2 instance and some there's some bottleneck there. But, you know, whether it be your your automatically scaling container than an EC2 instance or, you know, multiple containers are getting crowded in an EC2 instance. Yeah. So I mean, we're going to we're going to sort of treat this with layers, right? Where do we go back here? So, right. So you're going to start at the top. The first work metric you're going to get paged on is or the symptom is going to be that or that you're going to notice is that your whole website was slow to load, right? And then you're going to go and figure out you're going to look at the monitoring that the resource metrics and the events around that and to figure out what it might have been blocked at. So as a, you know, as a top level web application, the my resource might be a database that's behind me that I rely on and has hit some some capacity limit and is responding more slowly. I'm going to look at that and then start to look at what its work metrics look like and trace that down the stack. There's not one metric that I can give you that's going to say like the web request was slow and all of a sudden from this one metric you're going to know all the way back to the end why the web request was slow. I mean, you need to understand something about your own applications and how they interact and be able to work down that funnel. Well, happy to answer more questions afterwards if folks are shy. I've got stickers and other stuff up here if you want some swag. And otherwise, hope to chat with you guys about how you're monitoring your Drupal environments and other environments at some point in one of the parties tonight or around the conference. I'll be here all week. So thanks, guys.