 So, actually, I wanted to say thanks to Steve for that talk, because as you may have noticed that my other gentleman, Brian Lyles, I start with a short anecdote of how we met. So we were at a RailsConf, I think it was 2006, 2006, and so I'm sitting outside hacking on a project, just sitting there, and I see this tall, bald black guy walk up to me and says, you must be Randall Thomas, and I said, you must be Brian Lyles. And that was the start of a beginning friendship and cases of mistaken identity for the next six years, right? So be that as it may, turns out that was okay, because we now actually work at a company called Thunderbolt Labs, which we just founded about a year ago. So it turns out, nobody can actually mistake us anymore because we're almost always in the same place, it makes it a little easier. So thanks for being here, and if you, this is like an interactive session, don't like just sit there, because you're going to see a whole bunch of things that are probably look like lies, so call me on it. And there are going to be trolls, lots and lots of trolls, so just be aware, right? This is a troll warning. Anybody ever hear this quote? It's normally attributed to Mark Twain, but I think it was originally done by Benjamin Disraeli, and he was the first Earl of Beaconsfield, and he actually had a whole bunch of pithy things to say about how much basically the English parliament were a bunch of assholes. And he was actually referring to one of the ways that they were using statistics to sort of bias things against the poor at the time. At least that's the anecdote, you never really know. Ever seen this joke? All right, so it's funny. I was having a conversation with somebody the other day, and it said, how come every time somebody tells you don't panic? That seems like it's the only logical response. Have you ever noticed that? Car is on fire. Don't panic. You know what? It turns out your girlfriend's sleeping with your brother. Don't panic. Right? Well, I've got something else that I actually want to panic about. The rise of the data scientists. Has anybody actually heard this phrase yet? It's a new job description. Raise your hand. You have? It's normally some guy who basically has poor social skills, writes a lot of Python, sits around and says I know more math than you. This term incenses me. It's right up there with big data, as far as I'm concerned, as snake oil for the next century. This is bullshit. This is the worst sort of poor shit. And those are the people responsible for it. The enterprise right now is attempting to sell you a bill of goods that all this big data stuff is going to fix all your problems in the same way soap did. Remember, soap stands for anybody know? Okay. Has anybody actually written anything with soap recently? Did anybody enjoy it? Did anybody want to actually kill the people who did Whizzbill? Fuck yes. So it was funny. Soap started off and it was really simple. And then Microsoft and IBM started selling tools for it. And it became really complicated. Coincidence? I think not. So right now, I want to show you a picture. Anybody know who these guys are? Yeah. Those are scientists. How do I know? Because they put one of these on Mars. Well, I mean, you and your conspiracy theories, man. I've been trying to prove the moon landings for years. Okay, the moon landings were fake, but this was real. Because you know why? Shit was on YouTube. Yeah, exactly. I mean, and we all know Twitter's true. I mean, the Titanic was real. You guys know about that, right? Okay. This is the world you live in. Don't just sit around and look to your left and right and think that this is actually everybody in America. It's not. So those are scientists, right? Anybody know who these guys are? I'd be surprised if you did, unless you have a background. Anybody have a background in theoretical physics? So these were the guys in 1964 who basically unfucked physics with the Higgs boss on and unified several parts of the electoral week theory. They wrote the paper in 1964. We only proved that they were right this year, right? They fixed physics. And actually, to be quite honest with you, they make things that look more like the Death Star than Death Star. So these are scientists, right? Physics goes back centuries. This is the first image that came up for me when I searched for a data scientist. Ironically, and I wasn't joking, it came from an e-week article called The Rise of the Data Scientist. And I had actually not, I didn't even know there was an article like that, and I'm like, oh my god, I'm destroying the world with irony here. That's the picture that comes up. Would anybody write code with this guy? Because he looks like an architect to me. Ah, some of you worked with architects. So I got this ERD diagram. Anyway, so, so what, right? The problem actually that we have, and we're facing, is that about 50 years ago, computing became reasonable, right? And if you track computing, we created something called computer science. And then we started teaching people computer science and big O notation. And we started doing things like making sure people took compiler theory classes and that you took calculus and assembly language and all these other things and we created considered science curriculums. How many people here have actually programmed with somebody who was fresh out of computer science degree? How many people have actually tried to choke their pair as he tries to tell you that this is an O N log N squared thing for Ruby and you should really be using a linked list and we could really write this in C or C++ faster? Brian. We actually have people who pair with Brian, they have to sign a waiver of abuse. So the thing is, if you think about it, we're now just getting to the point where we're starting to unfuck 50 bad years of what happened when we took computers into the scientific era, right? With craftsmanship. It turned into this terroristic discipline where anybody could be a computer scientist, anybody could write software. And I think it's not unfair to say that the Ruby community, almost more than any other, has embraced software craftsmanship as something which it's more than just the understanding of knowledge and engineering. Honestly, I mean physics took a long time to get to the point of science. Computing is only 50 years old. So how did something that was invented 18 months ago with statistics become a science? Right? And considering now that more and more things are going to be statistically driven, what's going to happen when your job, your ability to monitor things, all these other things are driven by some guy who's a data scientist, right? What's going to happen when we have the same sort of terroristic reduction of what we do with statistics when we call it science? So the whole point of all of this is that you guys actually care about what you do and more and more of your job is going to be driven by statistics. And if you don't learn it and you don't understand at least the basics, you're going to be basically looking at some guy who's probably got an Excel spreadsheet and says that, no, no, no, we need 400 EC2 instances. And you're like, I only need 16 megs of RAM and about two megs of disk space, right? But he's got an Excel spreadsheet that tells you what you need. So to drive home my point, somebody's chuckling. Anybody? Anyone? Oh, yeah. These guys. These assholes. Night capital lost $400 million in 45 minutes. And by the way, the hemorrhaging was only stopped when they cut their feed to their automatic trading systems, right? I could imagine somebody running through going, cut the power, cut the power. It's like a bad scene from a movie, right? That's roughly $888 million a minute. Now, I don't know about you. That's a little less than I make a minute. Thanks, Jim, my speaking fees. But $8 million a minute. You know what the cause was? Test code in production. How the fuck does that happen? Not only that, it was a test code in production. On top of their deployment snafu, the algorithms that were calculating the fluctuations of stock prices weren't stable. It doesn't really matter how they did it, but they used a statistical method to try and predict the stock market, and it didn't converge. So it just foundered around this thing selling and buying. Turns out, anybody here work for DRW trading? Anybody? No? Oh, you're laughing. You know why, right? Yeah, exactly. I'm sure there was more bottles of McCallan and 18-year-old whiskey bought at DRW trading that day than on the same place else. A whole bunch of other people took advantage of their fuck-up and made money. So what's the point of this is that statistics is no longer becoming something that's virtual. It's something that is actually affecting your daily life. Some of your guys 401Ks lost 10%, because some guy actually deployed test code in production. So think about that. So I'm here to change your mind about the way you think about statistics and the way that you should actually be using statistics in your daily life. And that's why it's called taking back analytics, TBA. I'm going to give you a crash course in statistics here. And well, I might not finish it in 20 minutes, but we'll see. So I'm going to start with lies. We're going to go on to damned lies. And then we're going to finish with actual statistics. So let's start with the lies. Two types of things in the world. We've got continuous and discrete. Continuous are things that are floating point numbers, and all the other stuff are things like integers. So this is actually completely false, but it works for our purposes. You guys get it. It's like floats, ints. We get on that. We know the difference. You sure? I've read some of your code. Two types of statistics, right? We've got predictive and descriptive. I guess this is kind of self-explanatory, yes? OK. We're going to get rid of the predictive, because that shit's hard. We're going to just stick with the descriptive. And it is really hard. Doing predictive analytics is incredibly hard to get right. But really, there's only one thing that we really want to know. That's called a distribution. And this isn't actually talking about your 401k this time. We're actually talking about something that describes an underlying data set, right? And we're going to pretend that we can get any distribution, because we're going to follow the law of large numbers. Has anybody ever heard of this? OK, so what's the law of large numbers? Somebody? Anybody? Oh, Jesus Christ. I'm going to use that next time I get a DUI, right? Sorry, officer. I heard a lot of that. OK, a law of large numbers says this. If you do something a whole lot of times, over and over again, pretty much what you think will happen will happen, right? If you say you take a gun, you've been caught in the jumbles of Vietnam. You put one bullet in six chambers, what do you think happens after the seventh time you pull the trigger? Nothing for you, because there are only six chambers, right? So, but essentially, if you do something over an infinite amount of time, everything ends up looking like that normal bell curve that we used to. And that's because of the law of large numbers. We can pretty much say, on an infinite scale, most things behave normally. That's also a simplifying lie, but that's OK, right? The reason we want to say this is because if we have a normal curve and we have normally distributed data, then we can be completely lazy about the analysis we have to do on said data. Anybody remember this? Anybody ever get a C because of this? Anybody ever get a B because of this? I remember in my stochastic processes class, the A curve was 42. I'd never been so happy to have a 28 in my life. So yeah, bell curve, right? This is actually, strangely enough, it almost falls in the same thing as privilege, right? We talk about the distributed mean. But the whole point of this bell curve is that if you have normal data, you all have probably heard this. 95% of all the data is within two standard deviations of the mean. You're going to hear that a lot. Essentially, people are trying to make distribution so that they can draw inferences about it. Because essentially, even though I lied to you about there are two types of things, continuous and discreet, if you have an infinite set of things, if you can draw a distribution about it, you can start inferring things about the data underneath it. Even it's a set that can't be counted, which is kind of cool, right? It's like, if you have an infinite number of movies, I can tell you which ones suck. Michael Bay is the key. So OK, now we're on to the damned lies. There are only really five things you need to understand every data set, right? The minimum, we all get what this is. Yes? This is the amount of effort you're putting into responding to me right now. Come on, people. OK, maximum, right? Yay, thank you. There was much rejoicing. The mean, most folks call it an average. But stats, the guys like to be all fancy and put Greek letters and shit. So we say, the mean, it's an average, right? They actually separate a mean from an average in other ways, but for our purposes, they are the same thing. And the spread, right? This is a measure of variance. And variance is really important. All these concepts, when you start looking at, are talking about statistics and numbers and data sets, actually have an underlying concept that maps to reality. So for instance, you ever play a game of darts? You ever have a couple of pints while you play a game of darts? Have you ever seen where your groups and cricket at the beginning of the game of darts are way different than after three pints and played in a game of darts? Better. Jesus Christ. Maybe I should switch to whiskey. Yeah, well, OK, for most people, at least for me, who actually gets worse, if you guys get better, I don't know how that is, there's variance. So this is variance amongst this population right here. I get worse, they get better. The mean is, we're probably at just about the same, right? But it's important to understand this, because it tells you something about your data. And the count, right? This is actually kind of the one that's thrown on the top, because normally when you refer to variance, you actually have to split it into two, the upper variance and lower variance, above and below the mean. We actually ignore this because on a bell curve, we say it's the same on either side. But in reality, most things aren't so neat as a bell curve. So we're just going to say, how many of these things do you have? So if I got one thing, it's different for me to sort of draw a conclusion from one sample versus actually having an infinite set or a lot of samples, right? So we're going to come to the idea of some statistics. You guys following this so far? OK. This is one of the central concepts of statistics, which is, we don't know shit about shit. We can't actually measure it. We can't find it. We can't count it. But we can observe and do something, even an experiment, and then draw some conclusions. This is what we're all about. But we're not going to actually necessarily go through that. What we are going to do is we're going to try and give you a statistics black belt. And your black belt is going to be more like a sleep-to-leg Johnny sort of situation here, right? Because honestly, I started getting in statistics when I started working on proteomics data for a client of mine, and all the data wouldn't fit into the SQL server. And the biggest SQL server we could get at the time, and I'm going to basically tell you how old I am, had 32 megs of RAM, or 32 gigs of RAM, and a dual core processor. SSDs didn't exist. So we basically had a data set that wouldn't fit into one physical server. And we couldn't afford SQL server clustering. So I said, instead of actually dealing on all the data, why don't we just sample, get a few of it, and see if we can draw some conclusions? That led me down a rat hole of five years of trying to figure out what the hell most statistics people are talking about. They spent a lot of time. I don't know if anybody's actually done any research on those papers. Anybody read any statistics manuals, books? Pretty much written like you have a PhD in statistics already, so they're no good. So we're going to give you this crotty kid crane kick of statistics. You ready for it? Do I need to like? OK. The only thing you need is the histogram. I mean, has anybody here actually seen one of my other talks on statistics? OK, so normally, I'll fill you in on the other things. Normally I go for this esoteric bullshit like support vector machines, Bayesian filtering, and some fancy stuff. It's like, ooh. So what you don't see is the pan and agony that I've spent trying to get those things to work, where for six months it's like, I have support vector machines broken. It's telling me I like Bambi. I don't understand why. You don't see the Bayesian filter when it starts suggesting porn after you have actually bought a couple bottles of wine. I don't know about the underlying data set, but that actually did happen once. So those advanced techniques when you start applying them, generally speaking, have two problems. One, you're uncertain whether or not they're actually working, because that shit's esoteric. It's like 66 pages in somebody's PhD thesis. And I don't speak Greek. And then two, they are actually sometimes subjected to the very subtle flaws that you can't detect until, well, the codes in production, right? So in times like these, it's like a zombie attack. You go back to basics. And oftentimes, when I look at people who actually have come to Thunderbolt Labs with big data projects, I ask them about the basic underlying stuff of their data, like the histogram and distribution. And they say, what's that? I was just told I have a big data problem. So your karate kick is the histogram. The backup for you, of course, on this is the box plot. Now, the box plot is a visualization that actually gives you an understanding of the variance and distribution of this, because sometimes you're fighting, somebody kicks you in the back of the leg, and you've got to get back up, right? This is your backup, the next to next last tool of resort. This is a box plot, right? You guys seen these before? Yeah, actually, this is generated with R, and we're going to show you some code that does this earlier, and generating fake data sets. Ooh, right? So let's try and show you some of the code that does this. You actually got that? I'm impressed, guys. Nobody ever gets that joke. OK, let's see if we can actually act the same. Yo, I didn't know about that. OK. So we're going to fire up an ESS mode R server. We'll see if this works. It was actually killing the processor a little earlier, so let's see if it works. Because this is why you should never code live, right? OK, so what we're going to do is we're actually going to show you some of the data. And what we'll do is we'll actually show you the ones we cheated on and made beforehand. So this is actually a histogram that shows basically a fake data set with DB memory, right? I'm going to show you why we're interested in something like this. I always forget. I'm actually going to Emacs from Vim for a while, so I'm still a little short. What we're going to do here is we actually go through and create a fake data set. We're going to simulate something like a data center that has a whole bunch of servers in it. Anybody here do DevOps stuff or actually have to work with servers on a regular basis? So you probably aggregate logs, right? And I don't know if you've ever had this problem, but one of the things you're never sure about is if you have things like a database cluster or if you have things like a server, the question is, is the response time of that server out of band with the other servers that are in the cluster or in your group? Figuring that out isn't always easy, right? Sometimes you have one that just dies, but other times you're like, is it slow? Is it something I need to deal with? And as you start dealing with larger and larger clusters of systems, it really becomes difficult to understand what's going on there. So what we're doing here is we're actually these lines towards the top. Can you guys see that? So that big enough? OK. So what we're actually going to do here is that these first couple lines is we've actually just set a server count. And one of the things that you can actually do with distributions is that if you know the data or you think the data is basically distributed in some way, is you can create a simulation of that data and then draw from the distribution. Brian alluded to this earlier with a vosalius method. The reason you were going to do something like this is that if I say errors, for instance, errors, generally speaking, are normally distributed. This is with some exceptions in a system. So I can say, OK, I can model the errors or times or something else by saying it's normally distributed and create a fake data set that models a system. So for instance, if I know things like server times or mean response times, I could actually create a simulation of what my server environment looks like by running some R code, which is exactly what this stuff does, right? And damn, I've been fingers. So let's see if it actually evolves correctly. It's not coming up with the evaluation. Hold on, let's see what's wrong with this thing. Do I? I shouldn't have. I do. Thanks. Thank you. So let's start a bar again. I should do it. Here we go. Yeah, thanks. So I went so smooth when I did it earlier. This is why I should have actually done the Ryan Davis and actually just recorded it as a video. OK, here we go. So what we did is we've actually created some, oh, thanks. I can't spell either apparently. OK, so now it's actually up and running. So these are actually, sorry for that, these are actually what you see is the min interquartile rain, the min median mean third quartile, and max. These are the five summary statistics that we were saying that can pretty much describe any data set, right? So if you want to actually continue this to the logical conclusion, what we'll do is we'll actually just run through everything and show you some of the stuff that comes up. So now what we've done is we've actually created using this, we've actually created a fake data set that actually looks like some server data, right? So the whole point is we've created some, if you look down here, we've created something where we separated a cluster of 1,000 databases with 500 hostgres servers and 500 ysql servers, right? And we've saved some of them patched. We've actually done some crates per second, and we've actually created something about the memory utilization of all that. So you could imagine that if this were representative of an actual server log, you could go through and then create a big, huge window. This is actually, you know what? Everybody says, don't live code. So the actual image we're looking for, this one. So this actually should be the original boxplot that's created by that code. I'm not quite sure why the PDF render isn't working. So basically, the whole point is with roughly, without anything else, you could actually connect this up to a server log. You could actually troll files with about two lines of code in R. You can actually create an entire visualization of what's going on in your servers. So that's actually, one of the things to remember about boxplots, though, is that sometimes they can be visually incorrect depending on the spacing and binning of what's going on in the code. So you might actually want to overplot the images of to see the actual points themselves. So let's see if it actually renders this time. Well, that sucked to ask it. I suppose I'll have to go back and fix the code. So essentially, this is the one we were expecting to see. So this actually gives you a sense of understanding of exactly how far the real data points are, for instance, for your servers versus your memory. And this would be something like you could do to actually figure out other trends inside the usage of memory inside your server. So the last one thing is, I know that this isn't actually going to end up working correctly. We'll just show you the actual histogram, which is originally rendered. This one simulates actually checking the memory utilization of multiple SQL servers, or database servers, between 0 and 64 megs. And this is an actual histogram that runs over 1,000 servers. So for instance, if you wanted to understand which servers were possibly in a group or needed to actually have additional information or even some attention, by simply plotting a histogram, you can say, well, it turns out that the majority of the servers are somewhere between 20 and 40 megs of utilization. So depending on what you were doing, let's say these were Mongrel's, Thin's, something else, you could really easily actually figure out where you need to be putting your attention for anything from tuning to setting the God or Monit settings. So I suppose I should have actually had, let's say, the broken code, right? So what's the point of all this? The point is, essentially, at some point in your career, you're going to be faced with somebody who's actually going to be arguing with you over something, or you're going to find a need where direct tabulation of results or information is actually not capable of working. And when you run into that problem, the first tool you're going to actually turn to should be something that's statistically based. And one of the things about the Ruby community is generally speaking, we're pretty open-minded about embracing tools which can give us something or an edge for specific types of work. We work in JavaScript. We work sometimes in languages like Clojure. We work in Java with JRui. We work in Maglev. So don't necessarily always think of a computer language as your first tool. Sometimes it's actually something more fundamental, something more underlying. And that's it.