 I think we're just about at time. So, I'm just gonna go ahead and go, and hopefully there's no too late stragglers. They'll miss out on the important things, like, well, not really, important things like about me. So, I'm Jason, welcome, this is 100% observability. I'm a technical writer and evangelist at Datadog. I do docs and talks is how I shorten that, so documentation and I travel and talk. I'm also a DevOps days organizer, so I help with the global team, as well as I help organize DevOps Days Portland. So, if you're into DevOps, you should definitely check out your local DevOps Days event. I've also been a DrupalCon track chair for a while, so I helped select sessions for this conference. Hopefully you've enjoyed them. I'm also a travel hacker, so I get on planes and fly around to really nowhere in particular, just for miles, and then occasionally, when I do get off and visit places, I like to find interesting whiskey. On Twitter, I'm gitbysect. I also have an email, jason.e at Datadog HQ. If at any time, even after this, you think of a question and you're like, I wonder about that, feel free to email me or tweet at me. Also, if you are in the keynote this morning and you stayed through Q&A, I might also be known as PrestonSo or David Huang. So, you can tweet at them also. It's at PrestonSo and at Eatings, E-A-T-I-N-G-S. Feel free to just tweet at them about this session, about how great you thought they were. That'd be amazing. Have people heard of Datadog? Just quick show of hands. Anybody, just a few people? Cool, so for those who haven't, Datadog is a SaaS-based monitoring platform. Give you an idea of the scale of what we handle. We do 15 million data points per second, which comes to over a trillion data points per day. So we do a lot of monitoring for a lot of people. We have open source clients and libraries and a bunch of other cool stuff. I do have to do the thing that we are hiring. We don't do any Drupal, but if you're in Ops or you're looking for really cool challenges handling tons of data, we're hiring all over the place for all sorts of positions, SREs, Ops people, front-end devs, things like that. So definitely check out that link if you're interested in a new job. And it's Datadog HQ on Twitter, not Datadog. Datadog is a Black Labrador Retriever who will make fun of you if you tweet at Datadog. So again, don't tweet at Datadog. Tweet at at PrestonSo or at Eatings and tell them how great this session is. At Datadog, one of the things with monitoring is you want a lot of integrations and we have a whole ton of stuff simply because if you're involved in any sort of tech, even like Drupal related, we all face this, right? There's just so much tech out there. You can go down to the Expo Hall and you'll meet a bunch of these people. But there's this explosion of stuff that's going on. We live in this fantastic period of time in tech where everyone's creating really cool stuff and that comes to monitoring as well, right? So when we think of all the things that are out there, how we can monitor our systems and our websites and applications, there's just a ton of stuff. And if you talk to any of the vendors, they'll tell you that they monitor it all and you just need their thing and you should just pay them money. Or if you're involved in open source, then you have all these tools and you're like, what do I need to put together? So that's essentially why I wrote this talk, right? There's just too much. So let's try to make some sense of the monitoring systems out there, whether their service is whether they're open source so that we can actually get some full coverage. But before we do, we really ought to talk about how we should think about metrics and what do we need to be gathering? How do we need to think about the data that we're gathering? And there's really four good qualities of metrics that you should keep in mind. The first of those is that they have to be well understood. Obviously, if I understand a metric and you don't understand it or someone else in our organization doesn't understand it, then it's completely useless. A great example of this is the Mars Orbiter. I love this example. Is anyone familiar with the Mars Orbiter? Yeah, I saw one hand. The Mars Orbiter for those who don't know isn't more orbiting Mars. So the Mars Orbiter was created in collaboration between NASA and Lockheed Martin. NASA traditionally uses metric numbers, things like kilograms, kilometers, things like that. And Lockheed Martin, being an American company, traditionally uses imperial numbers, so miles, pounds, feet, inches, numbers like that. Nobody bothered marking what units things were on one of their orbital calculations. Everyone just assumed they were all on the same page. Obviously, their data was not well understood because the Mars Orbiter has crashed into Mars. So it's very key to understand what sort of metrics you're using, what sort of numbers you're using, get everybody on the same page. And as we dive into some of the monitoring solutions out there, this will come up and be one of the ways that we evaluate things. The second is that your metrics need to be sufficiently granular. So I love this. This is from the Olympics. I guess that was last year, down in Rio. Some of the numbers from the medal race for the men's 50 meter freestyle. It's really cool. I enjoy it because it's super fast. And as we can see, hopefully those numbers aren't too cut off on the screen. But everybody, except for the last guy, did it in 21 seconds. So if whole seconds are our granularity, then this is the special Olympics, everybody wins. We understand intrinsically that granularity matters. So for the Olympics, we do it 200th of a second. There's one tie here. There was a tie for sixth place. But similarly with the metrics that we gather, and as we talk about this different things that we need to monitor, our granularity will matter significantly, depending on what we're talking about. So keeping in mind that we need things to be sufficiently granular for what we desire to get out of them. The third property that we need from all our metrics is that they have to be tagged and filterable. The important thing here is, if we're generating a lot of data, even not at the level of trillions of data points a day, but you are all generating a ton of data, you're generating more data than you ever have before. And that metric data, you need to be able to make sense of it. And one way is tagging and filtering, and we'll see later how this becomes important. But generally the idea is, if you don't have that metadata around the metrics that you're gathering, then it's gonna be much less useful. And then finally is that your metrics need to be long lived. So as you think about your metrics collection, you need to be thinking about how long you need to store things. Storing things has different properties. A lot of times you're storing things to find trends. Oftentimes you're storing things so you can have a historical reference. Sometimes you're storing things because computers need that history as well so that they can start to make some predictions. So we'll dive into that a little bit later as well. So with those four key properties of being well understood, sufficiently granular, tagged and filterable and long lived, let's try to make sense of some of this craziness that we have. And the way that I like to do that is to start to think of our application stack. Let's think of the things that we run, the technology that we run that makes up an application. So obviously if we're drilling down, it starts with the client. It's our end users. How do they interact with our Drupal sites or our applications? Earlier this morning in the keynote, we saw Dries talk about voice communications, being able to have a chat bot into Drupal or being able to have an Alexa voice kind of skill into Drupal. So we need to think about how do we monitor what we would consider the client side? And then we're all here at DrupalCon so we're very familiar with applications like Drupal, things that run traditionally as backend code or applications on servers. That's pretty straightforward if we think of applications. And then finally the infrastructure. And infrastructure is really great. I love the age that we're in now because infrastructure is getting totally crazy. You obviously have things like hosted cloud providers, Amazon, Google, Microsoft Azure. But then you've got cool things like Docker and Kubernetes and Mesos that run containers. So we have to start thinking again about our app infrastructure and how do we monitor these things to become ephemeral where we can spin them up and chair them down and scale them. But one of the problems of when we think of things in our traditional stacks or think of them in these buckets is we tend to miss out and we tend to not see really where these edges are or we miss things because we're defining these edges and really we ought to be thinking of it less as an application stack and more of as a spectrum. Your end users when they interact with your application, they don't see a stack. They interact with you as a company and that goes from all the way in the front end whatever they're clicking on all the way down to the infrastructure and they don't care which portion it is or which team is responsible for that. They really care that it just works and that it works well. So if we think of things as an application spectrum it allows us to start to see where things blur and it really helps us when we're choosing monitoring solutions because monitoring has a lot of overlap. So when a vendor or an open source project says that they monitor something if you think of things as a spectrum it becomes much clearer to see where that monitoring ends and where you need to start filling in those gaps. So the five areas that we're gonna be covering for monitoring, the three that are traditionally thought of on the front end, the first is gonna be what's traditionally known as performance monitoring. Traditionally this has two main sections one's called synthetics and the other's called ROM and ROM stands for real user monitoring. As we transition across the spectrum into the application side we traditionally call this application monitoring suitable since it's monitoring applications but there's two traditional parts with this as well. One's traditionally called APM or application performance monitoring and the other's just known as application monitoring. And then finally as we transition all the way back into our infrastructure again very suitable name infrastructure monitoring is what infrastructure monitoring is called but again as we start to think of these three groups within monitoring understanding that there are these overlaps and understanding things like if we're talking about an application and monitoring an application you do have applications now that are running on the front end that you push code to them and a lot of the code is being executed in JavaScript in their browser or similarly when we're thinking of infrastructure we have that blurry line of if you're running in AWS and you've moved off of MySQL and you're using Amazon's RDS well is that infrastructure or is that an application? It could be both in those cases. So understanding that we have these gray lines on our spectrum where we can start to have fuzzy areas but knowing that we need to be aware of those when we monitor. So with all of that let's dive into the infrastructure side. Why is infrastructure monitoring important? Well obviously you need it because your application needs to run it needs some sort of infrastructure even if you have bought into the whole serverless thing serverless is just someone else's server right it still runs on a server. So you need that infrastructure but more importantly than that we need infrastructure because and we need to monitor it because downtime costs us money and we all inherently know this that when our applications go down it costs our company money whether that's actual money in sales or whether that's money in donations or just money in PR branding type things. For a little idea of how much money though this costs us well if you're huge and you're like Amazon Amazon went down about a year ago a little over a year ago and they were down for 20 minutes it cost them a whole three and three quarters of a million dollars. So that's one end of the spectrum it's huge that's a ton of money but even thinking about like Fortune 1000 companies so what places that we would consider your average enterprise IDC did a survey of the Fortune 1000 and they found that the average cost of an hour of infrastructure failure was a hundred thousand dollars as mentioned before I'm a travel hacker so I'm really interested in aviation things. This was an interesting one earlier in January Delta had an outage on a Sunday night they had to cancel about 170 flights it cost them eight and a half million dollars and this was due to a system outage but more interestingly and this is a little washed out but hopefully you can see that more than the eight and a half million dollars that they lost their stock went down Monday morning instantly just dropped off two points which this represents more than just the revenues lost this was tens of millions of dollars on their part so there is a cost not just financial to the things that we when we have an outage it's not just the lost sales we have to start thinking about the actual cost to our companies and their reputations. The benefits though the benefits of monitoring why do we do this? Well as G.I. Joe says knowing is half the battle right decreased mean time detection is really what we're after here getting faster getting faster jobs getting faster at putting things back up when they fail and then also not only just the detection but getting that information that we need in order to speed our recovery because if you can detect things and potentially if you can actually detect things before they happen you can avoid them. So let's step into let's take those four qualities of metrics and start to consider them as we look at to infrastructure monitoring. So the first that I mentioned was they have to be well understood so there's this ease of sharing information if we have infrastructure monitoring well understood means does it integrate with the tools we already use right? So for implementing some sort of monitoring tool does it integrate with our chat ops tool Slack, HipChat, whatever we're using even IRC if we're still using that can we actually send out automated messages so that people understand things. Also there's the ease of deployment and use simply can people log in can people make the dashboards that they need if you're not able to make the dashboards that you need if you're not able to get the alerts then similar to having metric and imperial numbers and people being completely disconnected and not understanding things people don't have that information they're gonna be disconnected. So as I mentioned integrations but aside from the integrations like applications that you need also consider things like the integrations with your infrastructure as you're starting to move to maybe things like hybrid clouds or hosted clouds how do those work with the traditional servers that you've been running? Can you monitor all of those? Can you monitor virtual servers or containers? And can we bring them all into one place? It's so easy for us to set up things and I know people love to set up open source tools so someone sets up an Agio server and they're monitoring something but how does that play with all the other monitoring that you've got? So try to bring everything all into one place. So that was being well understood being sufficiently granular here is really crucial especially as you're moving to hosted services. When you're hosted it's often really difficult to adjust these for yourself. So if you're looking at hosting on popular public clouds if you're looking at CloudWatch from Amazon you're gonna get one minute granularity similar with Google Stackdriver. If you're looking at Microsoft Azure you get one minute but that's only up to 24 hours. So as you think of the granularity that you need and you think of maybe having a one minute outage on the weekend and then coming back and trying to check in on Monday morning well with Azure that's completely gone because it rolls up into one hour and so you're down for one sixtieth of the time now which essentially means you'll never actually see that. That comes to play when we think of the dashboarding tools that we use. So this is just a simple graph showing a CPU spike and we could see the top is in one second granularity. It's really hard to see the peak there but if you were to zoom in the total spike there is 46.7% or thereabouts. And we can see where that actual spike is. When we roll that up to one minute we can see that that spike has now shifted to the beginning of the minute but more than that and again it's hard to see but off to the left of it we can see the scale and at this point it's only peaking because it's been rolled up into one minute. It's only peaking at about 36%. So now our peak is much less and then when we roll up to five minutes not only has that spike shifted all the way to the start of our five minute allotment but it's only peaking at like 11%. So when we roll things up not only is it just about losing information there's a sense that because the time is shifting here and because the level of the spike is being aggregated we're not just losing information we're starting to receive false data. So having that granularity isn't just about seeing what you need but seeing the accurate information that you need. So part three of our good qualities is that we need to have metrics that are tagged and filterable. So the main advantages on the infrastructure side of being cloud based and having things like containers or virtual machines is the ease that we get of scaling and distributing right. So when we move to Amazon it's really easy to have geo-distributed things in different availability zones different regions of the world. But the problem with that is that it generates all this useful metadata and if we're not capturing that if we're not tagging our metrics with where they came from what region of the world what size of machine they were on then we're losing a lot of that useful data to know in the future to help us understand our data and understand where our problems are whether they're happening in a specific part of the world or if they're happening say on a specific type of server or specific size of server. So when you think about our metrics on the infrastructure side we really need to think of them not only as the what is it and the how much and obviously the time stamp of when it happened but we need these tags of where it happened where it happened what type of system it happened on and that allows us to start thinking of our metrics less of as here's a metric we just need to put it in a dashboard and hopefully we can see it but actually turn it into something that we can query almost like a database so rather than looking at something we can query against it and say show me all of the information from my Apache server that's running on if we're an Amazon an Amazon large instance let's say in US East 1A that's running my Drupal application versus the Apache that's running maybe a WordPress so as we start to do this we can start to ask these questions and hone things down and it allows us to see patterns and trends and things like that. Speaking of patterns and trends keeping your information around for a long time so that again back to our qualities number four is long lived so for thinking of hosting again on things like public cloud well when we're doing the monitoring and retaining these metrics if we're using Amazon CloudWatch they're the industry leader at 15 months but similar to the way that Azure rolled up their metrics Amazon does that for their retention so they won't retain all of your data past 15 days we'll start to roll it up into five minute increments and then into one hour increments which is still better because they keep things around for 15 months compared to Google Stackdriver which you only have six weeks or Azure which is 90 days so about three months so start to think about how long you should be keeping your data in order to see trends 15 months is fantastic if you can actually keep things and 15 months for those that don't understand why 15 months because it seems like a odd number it's a year and a quarter because oftentimes we see patterns that go on quarters so not only keeping a year to see year over year trends but keeping things so you can actually see quarter over quarter trends so I've talked a lot about how we should be considering monitoring our infrastructure well, a lot of people are wondering okay well what should I use for monitoring there are a ton of projects out there particularly in the open source world so we've got things like Nagios or Nagios you can pronounce it either way I guess or Ichinga which is the newer version of that Senzu is getting really popular Senzu is essentially taking Nagios' idea and modernizing it for modern infrastructure if you're interested in that actually Howard Tyson has a session right after this at 5 p.m. a couple doors down Prometheus is another super popular one but you have to understand with all of these that with open source you're putting together parts in a very similar way that you are with Drupal modules and you have to connect a bunch of modules and configure them and put them all together so they work right if you're running Prometheus it's largely gathering data for you and you have to understand the nuances of things like the open source projects that you're using so for example Prometheus is fantastic but it stores one file per time series metric type so every single metric that you gather has a file and if you can imagine storing everything in one file well over time that grows so Prometheus they don't really recommend you that you retain your data for a long time they recommend that you actually put it into something like IronDB or another tool which actually is designed for longer retention so understanding similarly in the ways that we evaluate Drupal modules you have to evaluate your open source projects understand what they're good at what they were intended for and marry them with things that sort of fill in those gaps of things that it doesn't do obviously I work for Datadog so if you're thinking more on the SAS or past side something that's hosted for you Datadog's fantastic but I'm not gonna turn this into an infomercial CloudWatch if you're running on Amazon they do a great job they do have some missing spots but for the large part it's a good cheap service so we covered infrastructure let's talk a little bit about application monitoring the things that run over on our infrastructure so why is it important? Obviously the applications that we run are important to our businesses they're how our businesses make money they're how our businesses get known we already know that downtime costs money and I threw some stats at you for that but as we start thinking of application we also know that slow performance costs us money so we need to start considering that so as I mentioned there's two types under application monitoring one is just straight up application monitoring and the benefits here are that we've all built custom apps right that's the whole point behind Drupal's to build custom applications, web applications but we also a lot of us build other applications so we need to start gathering custom metrics from these applications and understanding how well we're doing on a business level and then also within application monitoring monitoring the applications that we didn't write so if we're running Drupal how is MySQL doing? Are we gathering metrics from MySQL? If we're running IngenX or if we're running varnish out front as a load balancer or a reverse proxy how is that performing? So again diving into those four key qualities well understood so if we're thinking about application monitoring largely we're gonna be dealing with some sort of SDK some sort of library on the code that we're writing so how do we integrate the code that we're writing into it or if we are using other applications like MySQL or IngenX or varnish how do those integrate with it? So well understood in this case for application monitoring means how well documented is it? Is there an API? Is there an SDK that allows me to get up and running really quickly? And then how many integrations does it have? If we're talking about something that's well understood it should have a good community it should have a good base if I want to run some other application hopefully there's a lot of integration so I don't have to write that myself and then as we think of the other three qualities is it sufficiently granular? So for our applications depending on what sort of granularity we're running we probably still want second or maybe we wanna start diving into sub second granularity so second granularity here is really good so if we're running MySQL how many queries per second are we running? But maybe we want that sub second we wanna know how long the query lasts what's the latency on a query so usually queries are measured in things like milliseconds. Tagged and filterable again super important here if we're running MySQL and it's been broken out into master slave or primary secondary we wanna know if we're having reads from a bunch of different MySQL which one did it come from? Which MySQL server was that from? Or if we're load balancing with InginX and it's diverting things to a particular web server which one is it coming from? What size is it? What's the metadata on that server so that we can start to understand what's going on? And then long lived again to start to see those trends on what we're how we're performing over time and what our systems are doing over time. So examples in this there are again a lot of examples typically application and infrastructure monitoring do overlap so all of the things that I mentioned on the open source side before things like Nagios things like Prometheus they can monitor some of your applications but again you're gonna have to piece together a lot of this so you often have to install some sort of collector marry that with some sort of time series database to actually ingest your data marry that to a dashboarding tool and then to some alerting tool to actually let you know when something's going wrong. There are a bunch of standards in how metrics are generated. The formats typically that there are two main ones CollectD and StatsD are the really popular ones. At Datadog we use StatsD or a modified version of it. There is a Drupal project out there if you are thinking of sending metrics from Drupal directly as application metrics to some sort of application monitoring tool you can use the Drupal StatsD project. Again popular services here are really gonna be the same so at Datadog we do it of course. CloudWatch does it. Most any of your application monitoring companies will be the same as your infrastructure monitoring companies but again consider how many integrations they have because you don't wanna be writing all of those. The other one within application that I mentioned is APM or application performance monitoring. And so here we need to talk about the cost of slow performance and the graphic there is how I feel when, how most of us feel when we hit a site and it's just super slow. And statistics prove it. The permanent abandon rate for a slow site is 28%. So people that hit a site that's actually slow will never come back, a quarter of them will never come back compared to only 9% for an outage. Which essentially means if your website goes down it's far better than just having a slow website. Walmart.com did a study for every 100 milliseconds that they improved on their page load time it grew their revenue 1%. So making things faster makes you more money. Google and Bing essentially did the reverse study they started to intentionally slow down their search sites and they found that they lost about 3% of revenue for every one second of delay. So that's why APM is important. We wanna monitor the performance and again how that impacts our revenue. But similarly one thing that a lot of people don't consider when they're thinking of an APM is there's an impact to your cost. Particularly if you're thinking of hosting on some sort of public cloud. When you buy time from AWS you're buying CPU time. Which means that if you're running code that's taking a long time that's not optimized not only are you losing money from your site being slow but you're losing money because you're spending a lot of money in order to run those. The other interesting thing is the optimized code has a correlation to less defects or bugs. And that's, we all kind of feel that as common sense if we write good code there's not gonna be a whole lot of bugs in it and it'll be faster. There is an interesting session on Thursday which I haven't met this guy but it looked really cool so I'm gonna try to check it out. Hopefully you will as well. Joseph Purcell on Thursday morning has one on code quality and he's gonna talk a little bit about essentially how to optimize your code. So as we dive back into our four key qualities of metrics, well understood. Well for APMs well understood means does it understand the languages that we write in. So for Drupal that's PHP but if we're writing in other things does it support that. And more and more we're all moving to be polyglots. We're writing applications and services in languages that best suit them. So does it support the languages that I have now but can it support other languages that I might shift into. And then more and more with other languages not so much again PHP but does it handle asynchronous languages. So if we're starting to write things in Go or if we're starting to write things in Python three has Python async IO can it handle these asynchronous transactions and asynchronous runs where things are starting to spin off concurrent processes. Again does this integrate with other applications. This one's a little bit different from your traditional application monitoring where you're gonna be receiving metrics from something like MySQL or varnish. But with this on APM side we wanna know does it integrate can it actually see what my queries are against a database and start to track those so we can see the latency and how our slow queries are affecting our actual code or things like our caches. So if something is in cache or not in cache we wanted to understand how that makes our code slow or more performant. And then finally how easy is it to integrate because nobody wants to go and have to modify all their code to insert all these lines to trace their code and to figure out the performance of their code. So does it work with the frameworks that we're using? Sufficiently granular in this sense. Almost every APM uses some second timers. If you find one that doesn't you should run away quickly. Like that's just ludicrous because code operates that fast. But when it comes to sufficiently granular we really wanna understand how our APM sample things. So with APMs you should be running them on your production systems. You're not just running it in dev test or on your local machine. You want to run and have real users hitting your systems and figuring out what the performance of your actual sites are doing. But in order to do that it means that you're gonna have a slight performance it. Because now you're running a little bit of an additional code on top of that. So we need to understand number one that it shouldn't negatively impact our customers because that's the whole problem that we're trying to solve with this. But in order to do that we need to sample. So we want it to sample and only run on a certain number of our users sessions when they're actually interacting with our applications or our sites. In order to do that sufficient granularity here means that we have to have statistically sound models. So you can't just be gathering the metrics for when something fails or when something is slow. You wanna collect everything that you can so you can understand your latency distribution and not just your extremes or your averages. Tagged in filter drill? Again, does it handle distributed environments? As we move to more and more distributed things when we're starting to run an APM and it's tracing our code, if our code is jumping from one server to another can it actually follow that along and see how that happens and how long that latency is. And similarly if we're running modern environments, if we're running Docker or Kubernetes, things that when a user hits it might run part of their request and then it might die and be picked up by another container or another system. Can our tracing actually manage to follow that? And then finally long-lived. So long-lived for APM, traditionally we don't keep things around too long because you just need to see the correlation between the code that you deploy and the performance of your application or your system. Although if you can keep things around for longer it's largely interesting to see your application performance over time. Performance changes when you change code are usually not huge or significant but over time they can build up. So if you do keep things around that's an interesting piece of data that you can get. So again, let's do some examples. There aren't too many APM examples in the open source world, particularly for PHP. There is this interesting GitHub project that I found PHP APM. So it is an open source PHP-based APM. There's also a project called Open Tracing. The website is opentracing.io. The notion here is it's part of, I believe, the Linux Foundation but they're trying to set up a standard for how we do APM or how we trace code so that you can essentially not have to build all the clients yourself and you can interchange the monitoring tools that you use. Zipkin is an interesting one. There's no PHP support yet. They're one of the leaders. LightStep is also one, they list PHP. It's in early access right now so you have to request access. Other things though that are interesting. Not really monitoring but can be, XHProf. I think a lot of us are familiar with XHProf as a profiler but there is a XHProf sampling module. So you potentially could try to set up your own APM with XHProf, have it run sampling on real world hits on your sites. Popular services here, New Relic's the big one. I think we've all heard of New Relic. This is their bread and butter. At Datadog we did release an APM. We don't have official PHP support but the community decided to do that so some guy spent a weekend and wrote an integration and published it on Twitter the other day so I haven't played with that. I don't know how well it works but definitely if you do use Datadog I'll give that a try. I have mentioned profiling a few times and so this is often one of the things that people get confused about with APM. What's APM versus profiling or sometimes tracing versus profiling. Profiling isn't monitoring. It largely isn't run on production applications. The way that I like to think of profiling is similar to automated testing. When we write code we run it, we commit it and we run it through some automated testing like unit tests or behavior tests and then we confirm that it's okay and largely this is where profiling really shines. You never get a hit on your performance from real world users because they're never seeing it but it has this great ability to when you check in code hopefully you can hook up with a profiling system that will just automatically run and it gives you a baseline for how your code should act and often times it can help you find bugs and things like that. On the open source side XH Prof again this is where that's really well known. Blackfire.io who I think they just walked in the back there. They have a booth downstairs so you should check them out. They also have a really great book that explains a lot of how profiling works and what you should consider there when you're thinking of profiling versus APM. So that covered two of our portions of the spectrum. Let's move on up into performance monitoring or the client side. And there's really two types here. Real users and synthetics. And a lot of people are wondering why you should use both and they actually work really, really well together which is why I've got Fry and Vendor up there. So when you're thinking of real user monitoring you really want those real user experience, real world experiences. You wanna start to measure how people are actually using your website and generating metrics from that because when we build applications we have these assumptions on how people will interact and often those assumptions, more often than not, they're wrong. People use their applications in really strange ways that we always call edge cases but they're usually not. So getting those real world experiences is really critical in testing the performance of your application. But more than that when we're using real users we get a diversity of testing because when we test we all have our development environments. We have, most of us are running really great souped up laptops with as much memory as we can so performance is an issue versus the test in the real world where we have someone that's in a third world country on a really poor connection that's using not really a smartphone like a semi-smart phone. How does that work? So we can start to get a lot of interesting things from real users around the world, different types of connections, different types of browsers. And this feeds back into creating synthetic tests. Synthetic tests being things that aren't real users so essentially robots, scripts that we can write that will actually run preset actions for us. So the great thing about synthetics is they're independent of your user activity. They still hit your production sites but you don't have to worry about users actually coming on. And the great thing about robots is they can often test things that real users can't or they can help test things that you don't have enough real users testing. So if you're thinking of things like accessibility well depending on what you are running how many blind users do you have? Maybe you wanna bulk that up and start to test things from a synthetic side that you don't have enough users testing on the real world side. So if we think about our four good qualities of metrics well understood, are we well understood? With synthetics that means do we understand what's being measured and can we easily update? On the RUM side that means can we easily see user sessions, interactions that make sense of what users are actually doing. Sufficiently granular. So for both of these again you're gonna have metrics that are in seconds or sometimes milliseconds often milliseconds for the front end but how frequently are tests running on the synthetic side and do we have synthetics on all parts of the system? And then on the RUM side is there low overhead? So similar to what we're thinking about for APM we're impacting real users. We don't wanna make the performance of what they're seeing degrade enough that we start to drive them away. Tagged and filterable with synthetics they're robots. So again we want to start to try to use the real user information that we've collected to improve what we're doing on our synthetic side. So can we make synthetic tests that are also geographically distributed? Can we make synthetic tests that are using different connections and using different browsers? Or are we just running from one server running Selenium on an old version of Firefox? Those are two very different things. With the RUM side can we extract that data from them? So can we know with our real user monitoring where they're coming from and can we gather the information about their browser and their connection? And then finally for long lived for both of these we really wanna see trends. We wanna see the trends and how our users are interacting with our applications and our sites. For real user monitoring it's a lot of the real user monitoring services out there have started to correlate against business metrics which is really really handy. Being able to see how fast something is to tying that into some sort of e-commerce system and seeing how many people buy things because things are faster. On the open source side there's not a whole lot here which is really sad. There isn't any really synthetic testing out there or true synthetic testing so that's why I didn't put it up here. There's a lot of load testing so when we think of things like that we're thinking of for example Apache Bench which will just hit your site and just as if it's up or down but very few things will actually run through and do a whole host of tests. On the rum side, boomerang is really the main one that's out there. Boomerang was started by Yahoo. The team that started it ended up going to SOSTA so they ended up running it. Again, I'm not a Datadog infomercial but if you do run Datadog and you wanna integrate boomerang this guy actually did it and it's pretty cool and also a little crazy but yeah, that's worth checking out. Yeah. Yeah, that's great. Let's do questions at the end just to ensure that we have enough time. So not monitoring but maybe, so things like site speed.o and show slow are really useful for actually starting to gauge your front end performance. They're not actually monitoring because they're not gonna be running all the time or sending you alerts. Popular services in this category are speed curve, catch point, pingdom. Those are all really popular. I've got two more sections that aren't actually within those three because I did mention there's sort of five sections of monitoring. One is other and specialized so here really it's just thinking about if you have a spectrum starting to think of where are those gaps? So for some people this might be things like network. If you're running all of your own systems and you're not hosting somewhere then you might be running your own network so you might consider how you network those things or how you monitor your networks. Other things would be things like security monitoring so there are security monitoring solutions out there that essentially will monitor your systems to let you know if they think you've been hacked. Configuration monitoring, physical monitoring if you're running your own data center, specialized so there are monitors out there like RunScope which are specially designed to hit APIs and test your API endpoints. Open source projects in this area. OpenNMS is a network monitoring project. OSSEC is a security monitoring project so those are both really interesting and then popular services here. Appnetta is like one of the leaders in network monitoring. Tripwire if you're familiar with it is they do security monitoring particularly for services. New Relic infrastructure so New Relic does APM they do have an infrastructure product which is largely configuration monitoring so they can let you know how all of your servers like what sort of software they're running. And then finally logging and other tools. So this is one that comes up a lot for me. People ask how does logging play with monitoring? And logging is cool because logging goes all across the spectrum when we think of the things that we're writing if we're writing apps or writing things that run on the front end we largely have those emitting logs to like console but sometimes we have those emitting logs that are transmitted back. Similarly we're all familiar with Drupal Watch Dogs who are generating logs on that end we're running infrastructure those all generate logs. Logs aren't monitoring though. Logs are a horrible way to monitor. There's a computational overhead to these and a storage overhead. When you think of logging how many of you are developers and you've actually written log statements out to Watch Dogs or something else? Yeah like most of you have and you've written a log and you've made it really long because you wanted all that good info, right? That's the way you write logs it's the way that we all write logs because we want that useful info but if you're trying to monitor off of a log you're trying to pull out that metric which means what? You're essentially running some sort of regex across a giant string to pull out that information and then you're doing math on that. You're either averaging it out or you're summing it up you're doing these things which so imagine that over a distributed system you're now aggregating a bunch of huge amounts of text which you have to store so that's expensive but you're also having the computational overhead of pulling the metrics out of your logs. So logs are horrible for monitoring but logs are amazing for finding that additional context and helping you solve things. A coworker of mine likes to point out that logs are good for discovering the unknown unknowns. Monitoring's great you set up dashboards for the things you know about but when something fails that you don't know about logs are usually really really useful for that because we've all write verbose logs. So really you want log management tools not log monitoring tools. Open source projects around this Elk, the Elk stack, Elk Elasticsearch, Logstash, Kibana, super popular Graylog is another one if you don't want to go down the Elk stack route. Popular services here would be Splunk I think most of us have heard of Splunk. SumoLogic is a great one. Logs.io and Logmatic.io are also fantastic. And then the other tools that have kind of fallen this that aren't monitoring that are often considered with monitoring on-call management tools. So things that actually let you know when something goes wrong. Air tracking tools. So a lot of times this is gathering up the errors from that logging tool and making them easily available and starting to track your errors. And then anomaly detection is often mentioned in this. So essentially how do we take our metrics and find interesting anomalies or know about things that we may not be able to see in dashboards. Open source projects for these. Cabit and open duty. Open duty is essentially an open source version of pager duty. These are really interesting but I kind of wonder about the wisdom of this. The whole point behind something like pager duty or VictorOps or ServiceNow, any of these that are alerting you. They're trying to alert you that your system went down and if you're hosting them yourself, chances are they went down as well. So that's why you want to pay other people to do it. But again, pager, duty, VictorOps, Opsdini, ServiceNow, all great services for on-call management. Air tracking. Sentry is an open source project. They also do a paid, hosted version. Airbit and squash.io. Squash.io is one that I hadn't found out until recently but that's from the people that built Square, the credit card processing like iPad thing that you find in stores. And then popular services, Sentry.io, Rollbar, Reagan, Bugsnaug. All good ways to start to track your errors and get that information out of them. Anomaly detection. There's even less here. There's some really cool projects but they're tricky to set up. So Skyline is the one that Etsy came up with. It took their stuff out of, I believe it was Graphite and would try to find anomalies within their data. It's no longer being maintained though. EGADS is an interesting one. That's from Yahoo and Luminal is from LinkedIn. EGADS is, it's interesting but it doesn't, it's not particularly meant, well EGADS, sorry, let me look at these notes. EGADS is actually really fantastic. It's written in Java though so it's kind of tricky to set up and it's generic so you have to start to write your own queries against things. So find a good data scientist and bug them if you actually wanna run that. Luminal is interesting but it's not particularly meant for real time and really if you're thinking about anomaly detection for your metrics, you want that real time to be able to understand what's going on and see those anomalies and get alerts. Other services here, Amazon if you're running in AWS has Kinesis. Again, this is a sort of similar to EGADS. You have to write your own queries for this. So again, find your nearest data scientist and bug them. Azure has machine learning which they say tries to make this easier. I haven't played with their machine learning so I can't say how good or bad it is. But in short, just to wrap this up, really follow your application spectrum. Start to think of things less of buckets or the teams that are actually running them and start to think of them more as a spectrum of what happens when your users make a request so that you can start to see those gaps of what you need. And again, remember those four qualities of good metrics. So is it well understood? Is it long lived? Is it tagged and filterable? And then is it sufficiently granular? We do have on our Datadog blog, we have some more information about how to collect the right data. So that would be interesting if you like to read blog stuff. But at this point, questions. And I know there's at least one. There is a mic in the center of the room and I think they want us to use that. And again, if you think of anything beyond this session, hit me up on Twitter, I'm get bisect. I'm not actually eating or pressed and so. But you should tweet out something if you haven't already just to mess with them. Testing, testing. All right, I think you had said that there weren't that many great tools for simulating users, is that right? So synthetic monitoring. Yeah, so synthetic monitoring is essentially monitoring with computers. Having computers pretend that they're real users and constantly running tests or running predefined tests to interact with your site and generating metrics from that. You could potentially build something with like BeHat to like run that and test, you would write a story as you would with BeHat and it would run and do things. But it's really hard within BeHat for that to start to get you really accurate measurements in time. So it would be as if with code, you were almost reviewing the monitoring with, like the code was also reviewing the monitoring itself and producing stats. Yeah, I mean it's not reviewing the monitoring so much as, example, if you were to try to do this in BeHat, you would have some sort of BeHat story that was like as a user when I go to the homepage and I click on this button to log in and I enter this login credential, I should see my user profile page. And actual synthetic monitoring tool would actually measure the time distance between how long did it take for the site to load? When the site loaded, how long did it take for the login button to appear? How long did it take after I entered the login credentials for things to happen? And largely BeHat wasn't designed for that. So you could put in some timers, but because BeHat wasn't designed for that, you will get some variability in how long it takes each run that may not be variability in your systems. Right, so tools for simulating the actions are out there, but tools for doing something with that simulation and getting metrics from them aren't as developed yet. Got it. Any other questions? No, I know that was kind of a barrage of like here's a lot of things and you should monitor them all. So I'm sure all of you are using some of those monitoring tools but you might be missing others, so consider what you're missing. Cool, well if there are no other questions again, hit me up on Twitter or via email. Happy to answer your questions and I'll hang out up front here for a while in case you do have questions. Thanks for coming. What's that? Yeah, that's my primary, I have a question. I mean do you use the command? Oh, I don't very often, because. Well, me neither. Yeah. I do sometimes. I don't know when I could have used it though. The very first time that I crashed something at Datadog. Yeah. So I had committed, I had committed like a minor like documentation change and pushed it to, so I didn't crash the main Datadog, did I? No, yeah I crashed staging essentially. Okay. Like staging is always tricky because you have a bunch of people committing the staging and so you have essentially like a race condition. What is going into me, having committed it yet? That's just the documentation thing. So I'm like, if you're caught up on that, go ahead and merge it because I shouldn't do any of that. And so he clicks merge and then QRCI and I like push the staging forward and suddenly it takes down staging. And I was like, what if he's committed it? What if he's on it? And then I looked at it because it's like my commit. Our system like flagged me and then if I took things down and then I've got like head and, dude, totally not cool. You don't just like commit like merge things and walk away. And it's like, I didn't, I like didn't even merge it. Like I walked away before I merged it.