 Okay, I think we'll go ahead and get started. Just so you guys know, I don't expect this to run the whole hour, so I might be able to get out of here early, and I know it's late, so it's not too heavy. We're not gonna be learning like statistics here, just to show you some light concepts that hopefully you'll find useful. I'm Mark Sonebaum. I'm a performance engineer at Acquia. I'm not a statistician, and I am not a data scientist. I'm just somebody who deals with data sets a lot, and the things that I'm gonna show you are generally pretty safe techniques for looking at exploring data sets without getting into statistical methods that can be much more error prone and best left to experts. Was somebody did this as if they couldn't hear me? Is that the case? Okay, good. So just a quick example of why this stuff is important. Has anyone ever heard of this data set before? Okay, great. So it's just four sets of X, Y pairs. And say we wanna look at, we wanna know how similar these four sets are. So the first thing somebody might do is do some summary statistics on them. Each Y and X axis for those four data sets have the exact same, X has the exact same mean, the Y axis has a very, very similar mean, standard deviation is almost the same. Those two summary statistics would suggest that these four data sets are almost identical. But when you plot them, that's what they look like. So it's a really, really great example of how if you don't look at your data and you're just using summary statistics, you may not actually know what's going on. You may be asking the wrong questions. And so that brings up the topic of exploratory data analysis. I was coined by John Tukey, I think he was 1977. He wrote a book on this. His main idea was that at the time data was being used to confirm hypotheses. But there was an issue with confirmation bias because you basically start with your hypotheses and then use the data to see if it passes or fails. And his point was more that you need to explore the data set to figure out if you're even asking the right question. Because once you explore the data set you might realize that there's questions, there's interesting questions you can ask about the data that you wouldn't have known otherwise. And so especially if you haven't dealt with any of this stuff before, I mean, I'm guessing almost everyone here has graphed some data in their life. If you've used something like a new plot or if you've used D3 or any of the JavaScript libraries, it's very likely that you were just treating it basically as a way to present the data. But data visualization is much more than that. It should be part of your analysis process. So the first thing that I usually look at is I figure out just if the data set is correct. I've seen results from people multiple times who didn't do this check and just showed me the results and say this doesn't make sense. And then when I go look at it, it's like, oh, well, this data is garbage. And so just having the data, like looking at the first couple rows, looking at the number of rows and make sure that it's exactly what you expect is a huge help in finding bugs. I always do this and I end up, it's pretty rare that like the first time I write out the data set is I did it the right way. I usually find something wrong with it. And so the basic process is you formulate a question that you want to ask about the data. Make a very fast, like basic plot just to make sure that everything looks right. And then hopefully that will tell you something about the data just by looking at it and then you start to refine your question and then you just iterate through this process. You also need to ask yourself can the question be answered by data? Sometimes it's unfortunate that you, you have the one data set and you may find that this is not the case. You can't actually answer it. But at that point it's a good idea to stop, to stop and either get more data and not report the results for like the limited data set you have. And because this is part of the iterative process, it needs to be very fast. The process of say like getting a CSV and then like pumping it into D3 and getting D3 exactly the way you like it that is not fast enough to support like an exploratory analysis workflow. So this quote, it's a really good one. Hadley Wickham I'll introduce later, but basic idea is that the bottleneck in this process is cognitive. And so we need to optimize the process for that, not for say like the presentation layer. And that brings us to R. R is designed with this process in mind. It's interactive. It, well actually I'll go into that in a second. You might ask yourself why I would want to learn another language because a lot of these things can be done in your existing program language. The interactivity of it is a huge help. It also has some very robust two dimensional data structures. It's very common in like the two general languages for this are R and Python, but not actually Python. It's like Python with SciPy and all the other things. And they have data tables, which I'll introduce later. It's basically like two dimensional like matrix like structures. It's like an in memory structure of a CSV type of thing. And also there's like built in graphics output. Almost, you can almost think of it as like a, like a third stream outside of like standard error and standard out. And the community is amazing. The base R language is a little bit limited, but we have what has been called like the Hadleyverse. It's like this guy Hadley Wickham, he's a data scientist that I think at Rice University and works for RStudio. He's wrote tons of packages that make the language like infinitely better to use. And usually for most tasks I'm actually just using functions exported by his packages. And so it's a, and it's vibrant too. Like when I started four years ago, half the stuff wasn't out. And so it's constantly getting better. So just quickly to cover this, my favorite way to install this is just brew install or brew cask install. I recently had somebody complain about how difficult it was to get compiled on brew, but don't try to compile it, just install the package, they're good. And this is what you see out of the box. You get a console, and then you get this other window that's the plot viewer. That can be useful, but you're probably gonna get to the point quickly where you want to save what you have and work with actual files. And so as much as I'm not generally like an IDE guy, I can say RStudio, it's a free IDE, it is excellent. It's absolutely worth using. You can, you just type your files in there like a normal editor, and it has a built-in. You can see this, here's where the code is. Here's the console. Here's some different ways to explore the data. And then all your plots get pushed over here. That's a very nice workflow, it works fine. I don't want to go through and just like tour like the whole R language, because there's a lot to show and it's kind of boring. But I'm just gonna like as we go, show you little bits so that you can understand the examples, but just some like very, very basics. The vector is like the most primitive type. It's a vector if you know a vector from any other languages, but the main idea is it's multi-value and every value in it has to be the same type. So this example of like a vector of numbers and a vector of strings. Data frame is the structure that you're going to be using most of the time. It's a two-dimensional data structure like a matrix, but every column has to be of the same type and the of equal length. So it essentially, you can like internally visualize it as like a CSV. And it's a real simple example of how to make one and then what it ends up looking like. Most of the built-in functions in R are vectorized, meaning so if I have the floor function and I give it one number, it gives me one result. And also that C function, that's just how you create vectors. If I give it essentially an array of numbers, it does that for each array. And so most of R assumes one or many. Like actually if you make like a character, say like something equals the letter A, that's actually a vector of length one. It also has functions as values, which we'll probably recognize like from JavaScript. You define functions by just like creating a function and then assigning that. And you might have noticed it has a weird assignment operator. I'm told that back in like Bell Labs, because it started based on the S language that was in Bell Labs in 1976, I think. That was apparently one key, but they just kept it. And so reading data in, you wanna get your data into a two-dimensional format. CSV or TSV works really well. I would avoid JSON because JSON is like infinitely nested and the data structure, there's a bunch of different libraries in R to read JSON, but you end up with like a pretty unwieldy object when you get done. You always have to flatten it. I'll put these slides online and I don't expect you to actually like copy down that gist, but I have a script that I use that flattens any JSON file. It essentially just takes all the nested keys and concatenates them with dots. And so I use that when I have JSON that I need to pull in. So here's just a quick example of how to do this. Here's a CSV that I got just from Varnish.dat, which obviously didn't come straight from Varnish.dat. It doesn't come out as CSV. So one thing you can do when you're starting is don't focus too hard on making R read the data format you have. Write a script in whatever language you have to create a clean CSV file and your life will be a lot easier. And so there's just like the super, super basic reading the CSV file and making a plot. This is using the base plotting system in R and that's the last time I'm all show that. It's not very interesting and there are better options now. And just to quickly show it, it also has a pretty good package for reading logs. Like basically the log format is delimited by spaces and then also respects quotes. And so I have a sample access log like this. I can use this web reader package. Just you read CLF, same for common log format and then I get a data frame back, parses everything into the right columns and then assigns the right type to it. And just some quick terminology before we go forward. In this world, you'll hear, you'll hear talk about variables and a data set. A variable is essentially a column and rows or observations. And there's essentially two types of variables. One is continuous. You'll most likely like, I think essentially the only two are numeric and date. So continuous means on an axis it could go on forever and there's no like discrete steps in between. Versus categorical or discrete. I think D3 maybe calls it like ordinal or something. That's just individual strings essentially. And I'll show examples of that. But it's good to recognize which one you have because that will affect what you're able to do with the data in terms of visualization. And so just some basic plots with those two ideas in mind. Working with this data set here. This came from a monitoring system that just checks a website from different locations. So say you just wanna look at one continuous variable. A continuous variable in this example would be like the time that it took in milliseconds. The most basic, like simple visualization for that is just a histogram. It's one of the most useful plots that you can make because it gives you a very clear vision of what the distribution of data looks like. And then I also like to add that bottom piece which is called a rug, which just shows, it also shows the distribution of data. And it can be useful if your bins and the histogram are wide. Ben you'll notice I added a bin width. By default it just divides it into 30 bins. But this is something you always wanna play with because like so the bin is basically how wide the bar is. And so you can get different insights out of the data by messing with the bin width because it's very domain specific. So you have two continuous variables. So say in this case we're dealing with response time and dates. The simplest way to do that is just a line graph. This may look a little boring. And I know like now everyone uses stuff like Grafana or Grafana or Grafana and the other one that it's based on. What is it? Cabana, yes. And so everyone immediately just wants like what's called an area graph where it's like it's dark. You have like the outline and everything under it's colored. But that conveys actually no more visual information. So I would encourage you to just like live with the boringness of these because it doesn't matter. The other way we do that is actually a simple scatterplot. And you'll see here I added alpha. Adding alpha to scatterplots is very useful because you can get a better idea for the density. So you can see like up here it gets a little softer. So I can know that's only actually a couple points whereas I hear it's essentially black. Meaning there's tons of points there. It's also an example like an R you can render to, I think it renders by PNG by default but you can also render to SVG. This is a plot you don't wanna render in SVG because you can end up with a million little SVG objects in your browser and you don't want that. So if you have one continuous and one categorical that's when you end up with, I don't know how to describe it. I think it's pretty obvious when you're looking at it. So you're saying I'm saying the categorical variable is the agent. And so it splits it up by agent. If I didn't add that, then it would just basically be one plot in the middle. And this is a box plot. I'll go into like what that is a little bit later. And I also didn't introduce at all this plotting system. It's GG plot. This is like the one that I like to use. It's one of the most popular. And this AES function stands for aesthetics. And so the idea there is that you're doing aesthetic mappings. And so you're saying I'm mapping the agent to the x-axis, time to the y-axis, and then I'm mapping the color aesthetic to agent. And then so it knows how to essentially just make the plot once you've given it that like semantic information. So for two continuous, one categorical, you go back to essentially the line graph but now you have a line per categorical variable that you told it to make. And this is actually very hard to read because it's all just on top of each other. And so another way, right, another way to do that or if you also have two categorical variables and you want to split it up two different ways, you can use what's called a facet. And so you can say do that same thing but actually, yeah, my slide is wrong. This is not two categoricals, it's still one categorical. It's just clearer because I'm using agent for the facet and then also for the color. But it's a nice way to just quickly break it out and get a graph per item in the list. I use this one very, very often. And so there's sort of a joke in the data science community that you spend most of your time actually like getting your data in the right format to work with and it's very, very true. I think this is one of the things that trips people up a lot because you will never get clean data sets. You will always have to do something with it before you can actually like start plotting or just start exploring it. And before when I started, I would do most of this just like in a Ruby script and just get it like pristine and import it which if you're new, that still works great. But the tools to do some of these more common things are now in R, like this tidier package which is another Hadley-Wickham package. This example of the most, one of the most common things you have to do. So if you, like that initials of Varnish as CSP we looked at before, you have time, client connections, client requests, cash hits. The value of all of those columns are essentially the same, actually with the exception of cash hit, but it's a numeric value. And if you only ever wanted to say, so you wanted to graph time and client connections, you can do that pretty easily. But if you want something like the graph I showed before where there's a line per client connection request or cash hits, then you have to rethink what the data set looks like in terms of what is a, what is a value and what is a variable. And in this case, you could argue that client connections and client requests are not actually variables, they're values. And so you can use this gather function to make wide data long. And so you turn that into a metric and value. And then whichever the value was is in the value column and then that assigns to the right metric. This is like the most, it took me a while to figure this out when I first started in R, but it's like the most common thing you'll have to do is you'll have to always make wide data longer because everyone makes wide data because it's easier to think about it. It's like it looks that way in Excel. But then as soon as you do that, you can do a really simple aesthetic mapping. So you can map color to the metric. And now I have lines per metric that I had. You can also end up just another simple example. You can also end up with stuff like, like especially performance tools might, they might output their value with say like milliseconds attached to it. That sucks because I can't just like plug that in because it won't get parsed as numeric. And that'll actually essentially end up as a categorical variable not continuous when milliseconds is obviously continuous. Then I can use like this extract function, give it a regex value and then it splits that out for me. And then I also retain the milliseconds unit. So if there's anything that's different, I can treat them differently based on that. And there are a bunch more verbs in that package, but there are things to look up when you have the problem. Manipulation is another huge thing you'll have to do. The dplyr package is the best option there. It actually works with more than just the data frame type. It works with databases as well. It has a very SQL like set of verbs. And so you can actually translate it to SQL and use it on a database. It is also, I'm gonna take the time to introduce this pipe operator. It's not unique to this package. It's supplied by another package, but it's used a lot in modern R, and especially the packages I'm gonna show you. I'll show you why. So say you have this line of code. You're calling filter and then you're calling select and then you're calling head on that. That can get a little unwieldy. So this pipe operator lets you chain them together. And so it's much easier in terms of how you think about the data flow. It's much easier to edit. You can just comment out lines instead of messing with the parens all over the place. And so all that means is take what's before me in the pipe and use that as the first argument in the next function. If you're familiar, like Lisp and pressure closure has an operator for that. It's a common thing in functional programming. So filter, the absolute most common thing you'll have to do so that dataset I had before with the different requests coming from different places, I can easily say, okay, filter where agent equals this and then also where the date time is within this range. And then just continue to pipe that to ggplot and plot it. It's also a really great way to narrow down the dataset. Like I was recently looking at a set of log data for one day and it was, I think it was like two million hits, hit the log. And that was a very big, if I did the wrong thing in the plot, my machine would just sit there and hang. And so I first do a bunch of filtering to make sure that all of my visualization is right before I plug the entire dataset in. Group by and summarizing. And you'll notice that R was created in University of Auckland, New Zealand. And so the language itself uses British spellings. That's not a typo. Yeah, you may have noticed before I think Keller, how do you? So group by, I mean, you're familiar with the concept from SQL, but here's an example, I'm not plotting here, but I'm just doing a basic summary. And I'm saying that filters, basically filter out requests that don't have a time, because some of them didn't have a time. Group by the agent, and then do a summary. And then say, in this median and quantile, these are just base R functions. And so I'm saying, do the median quantile for the 95th percentile, 99th percentile on this column. And then because I called group by before, it knows to do this only for the rows that they're in each group. And then the result is I have those four metrics for each agent. Mutate is the way to create new variables. And so it's an extremely simple example. Say I have the time was in, I think, seconds, and I wanna see in milliseconds, I can just say, okay, mutate, total time in milliseconds equals the total time column times 1,000. And now I have that new column that I can plot. There's a bunch of examples of how you could use that, but it's a little boring. And so summary statistics, this is something that I think everyone is pretty familiar with, but I think we all need to think about it a bit more because it gets misused a lot. Mean and standard deviation. Mean is average, for those of you who don't know. Standard deviation, like you may have heard people talk about having an average and then not having a standard deviation is not a good idea because it gives you no idea of how distributed the data set is. But even having the standard deviation is dangerous. So here's just a quick example of what we're usually talking about when we're talking about these metrics. So there's a normal distribution and each of these columns is a standard deviation apart from the mean. And the general assumption when you're talking about standard deviation is I think it's like the first two standard deviations before and after are 66% of the data set. The two standard deviation I think is 95% of the data set and then three, 99% of the data set should fit in there. And this example shows that that's the case in a normal distribution. Here's some latency data. This was recorded from a web service and so it's response times. The red dotted line in the middle is the average and then those blue lines represent that many standard deviations away from the mean. And in this example I think it's pretty clear that the numbers I talked about before about how many or what percent of the data fits within those standard deviations is just complete bullshit here. It does not work at all. And so the reason is this is a multimodal data set. This is not normally distributed. So normal distributed means bell curve. If you're dealing with response time data or response time data is never normal. It's often multimodal. It's often or not only not normal, it's often multimodal. This is an example of a multimodal data set. There's one giant mode here. There's also a little mode here. Which is why the mean is actually in the middle. And so if you just have that number it doesn't really tell you a whole lot about the data set. So don't use these metrics unless you absolutely know for sure that the data you're working with is normally distributed. And I can assure you if you're dealing with response time data it will never ever will be. So one better option to use is called the five number summary. This is also from John Tukey. It gives you the minimum and the maximum. The medium which is the 50th percentile. And the lower and upper quartile which is the 25th percentile and the 75th percentile. It's also much easier to reason about because you can say so for the 50th percentile of the medium 50% of this data set falls below that value. But the 75th, 75% of the data falls below that. And if I do, so the summary function in R actually gives me the five number summary plus the mean. And you can see for the mean here is 25 and the median is 12. You can see just sort of how worthless the mean is in that situation. But there's also what the box plot is. He's also the inventor of the box plot. In a box plot, the line in the middle of the box is the median. The bottom of the box is the 25th percentile. The top of the box is the 75th percentile. And the lines usually extend. There's different algorithms for how to do it. But the lines usually extend to whatever it considers. I think it's actually like the 95th percentile. And then it still plots little dots for the outliers. So you can see what the long tail looks like. I really, really like box plots for really any data set because it gives me a very quick way to see what's going on and what the distribution looks like. I can get it, I can do it with histogram, but I can't, I can't, not to fast at a histogram. I can't divide a histogram up by different categorical variables. Joel, I'll show you in a second. So if you've ever used AB, which I'm sure lots of people here have used AB, here's a screenshot of its output. And it gives you a bunch of different numbers. The mean, that's totally useful. The maximum, that's totally useful. It gives you mean and standard deviation. Don't ever look at these. These are, just ignore them. Pretend they're not there. The bottom part, where it shows you the percentiles. So it's showing you the 50th percentile, 66, all the way up to 100. That is the absolute, that is the only data you need. It is the most useful data. It's also much, much easier to read into it. You can say 99% of these requests were under 239 milliseconds. But 50% were under 35. These numbers do not give you a useful way to describe the data set. And it's really unfortunate. Most load testing tools will give you the average. Hopefully among other things, but the treatment of this data is pretty lacking in most tooling. So, and just to show another example of how to look at, like other good ways to look at latency data. So I showed the box plot before, but a lot of times we want to see it going over time. You can also just divide it up into bins. And so I think for this one, I'd say like every 30 seconds, give me a box plot. And it's telling me quite a bit. I can see that the distribution is like very, very tight. And then all of a sudden it goes way up. And so something there, something is happening there that I would have missed otherwise. Another way to view distribution data over time is a heat map. This one's a little hard to see. But you'll actually, you'll see like the sysdig is doing something with this now. Like detrays had a bunch of heat map stuff. It's nice because you can see that there's a time going forward on the x-axis and the response time on the y-axis. And it's sort of like a moving histogram. The intensity of the white here is how many data points fall within that range. And you can see there's also like, so right now or on the very beginning, it's really tight, it's all just white there. But then all of a sudden, there's these like little tiny gray splotches. And you can see like some requests were falling in that range, but not all of the requests. Still more requests were down here. Which again shows that this data is multi-modal. And one of the reasons for that, for latency, is there's usually always a fast path and a slow path. If it's like a Java app, that slow path is almost always garbage collection. And so you'll see one, like a second mode because every time garbage collection runs, it's however many seconds that was added to the response time. For something like a Drupal site, it could be however, like whatever that interval is where you missed your cache. And so every cache worm, that's another mode. And so in terms of the actual presentation side, like once you go through this process and then you get a really good understanding of the data and you want to show someone, there's also some really great tools for that. Our Markdown is a very popular format supported well in our studio. Here's an example of it. It looks essentially like GitHub flavored Markdown, but it doesn't just embed the code. Any time that code outputs something, whether that's to the console or an image, it creates, or for the image, it actually creates the image and then links that into the Markdown document. So this presentation is actually all written in our Markdown. All of the plots you saw before, I didn't place any of that there by hand. I just put the R code in, rendered it, and then they all just show up. And so it's nice because I can just write a report, put the plots in, I can choose whether or not it's going to show my code in the plot. It's usually a good idea to keep the code in there and if you're trying to create reusable research or reproducible research, you want someone else to run that and get the same results, but if I'm giving it to my manager, they don't want to see my code. And there's also a one click published to a site called RPUBS, which is maintained by the company RStudio. This makes it even nicer because I can just get it to where I want it, say publish to RPUBS, and then just give somebody a link to that and share it instantly. If you want to, you can go to RPUBS slash msonobomb and look at all of my stuff there. There's lots of R examples about how to do things. And just to quickly show, all the plotting before was ggplot and one thing you might be thinking when you're looking at it is that it's all static and we now live in a world where we're web developers, we want everything to be dynamic. We want to be able to interact with it. That's valid. It's not that conducive to the exploratory process. Actually, being able to interact with it in the exploratory process can be important. But there's this new package called ggviz, which I think will essentially, will eventually replace ggplot, but it's not quite there yet. But just to show a quick example. Is that bigger? So this looks similar to the example before. I add a histogram layer, but then I can add an input slider. Instead of giving the bin width a value, I can say input slider. And so if I run this code, right, so in the viewer here, that's actually interactive. You can see what the histogram looks like when you change the bin width. Things change quite a bit. And this is all HTML and CSS. It essentially renders to a library that uses D3 on the front end. And I can pop that out and actually view it in my browser. It is pretty mature. The only thing it doesn't have is faceting. It's the only reason I haven't switched to it. Yes, Daniel? Very, very good question. So on the back end, it uses a project called Shiny, which is a R web server technology. So it spins up a web server in the background, and then it handles the interaction between the front end and the web server. Which is very important, because if you're using, say I was working with a data set that I mentioned before, there's like a couple million lines. If you put that in the browser, your browser will crash. There's no possible way to do that. But if I do this, I can actually just have that in memory sitting on the R server and just make requests to it. And it'll still be very fast. And there's also like similar to R Pub, there's like a shinyapps.io. Once I make that, I can just publish it there and then people can interact with it. And so if you want to learn a bit more about this, there's a bunch of books on the subject, but I actually suggest too that very recently came out. Roger Pang works at Johns Hopkins University in Biosatts, and he runs like the R Coursera courses and the data science specialization. And he put two books recently on Lean Pub, which is a site where you can just like pick what you want to pay for it. And they're small, they're really to the point and they're very easy to understand. They're not written from a perspective of like knowing a bunch of statistics background. And so I'd suggest that was very good. And those are the packages. This will be online later if you want to look at the packages that I showed. And that's it. I would ask that if you can find this session node and do an evaluation, that would be awesome. And I'll take questions if anyone has any. Yes, please use the mic if you have any questions. So, don't think the mic's on. Oh, there it is. So the, are there any resources that you could point us to for that basis in statistics? I was a lowly art major. And I taught myself engineering and there's a huge hole. Oh yeah, I was a major. I'm right there with you. So, yeah, so that's a big topic. None of the things I showed really requires much statistic knowledge. When you get into things like, I basically showed, I mean, ARR has an incredible support for modeling, like forecasting, like doing linear regression, that type of thing. I didn't show any of that. Because those were, I mean, some of those were simple, like doing a linear regression is sort of simple. But that quickly gets into the realm of if you don't have like a pretty solid statistics background, you can come up with incorrect results and you won't know that they're incorrect. And it's really best left to people who know that stuff well. I would say if you want to know it better, the data science specialization on Coursera is very good. Specifically, the statistical inference course. It's also very hard. I got halfway through and I had to take a break. And it's also, I found for me, it was interesting to know and I wanted to brush up on those skills. But it's not that useful for what I do. I'm not, say, looking at a population of people and trying to find clusters and trends and doing like K-means clustering, that kind of thing. I'm only dealing with these relatively small data sets that I can look at and then make very simple inferences. They're not such a large data set that I need to create a model for. So it may not, it may be the case for you that you don't need to go that far. But I would encourage you to support and see. Yes, go ahead. I have a follow-up if nobody else does. Could you speak to your data collection process? I know you talked a fair bit about the analyzing after the fact, but presuming you're not logging and sampling all of the metrics all of the time, I'm sort of wondering what tools you use for that collection initially, to get it into the thing that you turn into the CSV. And then also, is this sort of usually something that you do mostly to diagnose a problem once you've actually observed some thing? And so are you going back to the snapshot of the outage where you already have to have that data in advance or is it usually? I'm not doing that. It's really difficult to get the kind of data. I'm showing everything as raw data here. There's no aggregation. It's very difficult to get that kind of data from production with zero aggregation. It's hard to store. It's hard to get out of the machine in an efficient way. So typically you're not gonna have that unless you're Netflix and you have one second resolution. Very uncommon though. Most of the data that I'm looking at are, I mean it would be great if the server level tools did this stuff by default, like they would output a format that I can just easily ingest. Unfortunately it's not the case. This is usually me setting up an environment, trying to recreate a production event, or just say like just doing, testing something that just needs to be benchmarked or whatever. Like I did this with a bunch of droop light stuff. And then my favorite tool for monitoring and getting good data out of is DSTAT. DSTAT is a Python application that wraps a bunch of other tools. Like it'll read PROC VMSTAT. It'll basically read PROC for all the things that you want. And the output, you can combine essentially the output of like VMSTAT MPSTAT, IOSTAT, NETSTAT, all those things that are very useful but you can't ever see them all together. You can run DSTAT with all the flags you want and then give it an output file. And the output file will be a CSV. It's a bit of a pain in the ass because like most tools that write CSV they do something to fuck it up. The header is not like valid CSV like it actually has, it doesn't mean rows. So like the odd numbers are essentially one CSV and the even are another. I have a function that fixes it. I have like a local R package. I'll probably publish it eventually that reads in DSTAT files. But usually I'm doing that. I'm like before I would like run MPSTAT or run IOSTAT, get the most parsable output I could from it, write a script to convert that to CSV and import it. But usually I'm doing that at one second resolution because in production you'll probably have something like a 10 second resolution because you don't want that like constantly running on the machine and taking up CPU cycles. But the problem is there's a lot of issues in production that you will miss. They happen like very short CPU bursts. You'll miss them every time if you're only sampling every 10 seconds. So this process in general, I mean like I find performance work is this way but especially like this analysis process. As developers we want to automate everything. You really, really need to resist automation because if you automate things you don't let the data surprise you. It's really like antithetical to the whole exploratory data analysis process. And I find people doing this all the time where like they automate this process and they get the number back and they're like oh, like here's the number. I trust it but it looks really fishy. And I want you to actually dig into the data. It's like okay, this automation is making an assumption that it's not appropriate for this particular data set. So I do, I write a lot of throwaway code. Most of my arcs like I write that script. I don't even bother putting in source control because I'm never gonna use that again. Which is why our and those libraries I showed were so great because I don't build up a body of code that I need to save. It's like it takes me just as long to rewrite a lot of that stuff as it does for me to go find where I did it last. Does that answer your question? Yeah, absolutely. If nobody has another one, I had one last one and I promise I'm done. We can talk all day. We've done it many times. Does Grafana have any part in this process or for anybody that doesn't know, it's a tool for building graphs a lot of the time out of kind of more real-timey, constant stuff like graphite or influx DB backends where you'd feed those metrics in all the time and then you can graph them. I'm kind of wondering if R is complementary to having that. If R would point you at the things you might try to create samples for and automate. If it reveals automation that would be useful or if it's just sort of like... Antibiotical as well. They're pretty complementary. Those tools like all performance monitoring tools just love the shit out of line graphs. Everything's a line graph and everything's... The x-axis is always times. Like everyone wants to see the line going over time. Problem is for a lot of data points that actually doesn't give you the data that you need. Like having heat maps, having histograms is much more useful. I'm sure you've probably seen in Grafana, it's like, okay, this line is constantly going up and down. It's very difficult to recognize that that's actually like a multimodal dataset in some cases, right? I know like Brandon Gregg at Netflix is like really pushing performance companies to have more of those types of visualizations that are like richer in tools. They'll end up in hopefully tools like Grafana. But right now things like Grafana are very much tied to Grafite back ends that are time series databases. And so I like to treat those as just two completely separate things. I mean the original, Grafana, is from Elasticsearch. Elasticsearch can actually do a lot of these aggregations. So you could have really interesting visualizations come from it, but everyone loves line graphs, so it doesn't happen. Hey thanks for the talk, that was really cool. Thank you. So here's the question. This is great for you to explore the data on your own and getting some insights about what you see and perhaps give some feedback to your people, your team. How about when you want to explore, sorry to expose this data to other people so they might be looking for other aspects or they might be looking for different aspects. I know that there is a sleuth of attempts in the Drupal community to create some modules that would play with data, charts and graphs, graphs, D3, others, what is your favorite, if you have done such a thing, what is your favorite stack? Also I know that there's a decan guy over there who has also a nice visualization module, right? So for things like that, really all you need is a CSV file at a URL. And so if I've ever done, I can't remember, but if I ever were to do that, I think I probably have, as long as that CSV file is public and then say I'll do something in R Markdown and read CSV can actually take a URL instead of a local file and so it'll pull that data set down, do the analysis, I'll publish that and say okay you can see it's pulling from the URL, here's all the code I used to generate that, you could copy this code, fork it and make your own analysis of it or you could just read that data set in directly and use whatever library or processes you want. Python has IPython notebooks, I think which is called Jupyter now, which is not exactly like R Markdown, but there are lots and lots of overlap that I hear really great things about that project. But if you're doing data analysis on anything, I really don't want to use the word big because big data is like no one here has big data, it's like terabytes, even medium data, none of us even have that. Because R does everything in memory, so if you have data set that is maybe close to like your memory size and you might be hitting the limits or it's just big enough to where it'll be very slow in another programming language, you really want to be using R or Python, something that can bind to C and Fortran and it does like matrix math in an efficient way. If you're doing this and then, if you're getting this data and then doing the analysis in the browser and then pushing it out to D3, it's likely that like the data set is pretty trivial to begin with. So a tool like this may be Overkill. I personally find using D3 from scratch to be like incredibly painful and not conducive to a iterative process at all. But if you really like that, you can totally do it. Anything else? All right, thank you.