 All right, so I'm presenting today on enough statistics so that Z won't yell at you, okay? My name is Devlin Daly. So actually, I contacted Z about this talk and I asked him if I'm gonna present this, is it really gonna be enough so that you won't yell at somebody? And he said no, probably not. But you can try, so we're gonna try it today. Let's see what we can do, okay? Some caveats, I am not a statistician at all, okay? I'm actually into digital identity, so if you guys wanna talk about OpenID, OAuth, SRP, some permissions, anything later on, I'm gonna talk to you about that. But I've taken a few classes on statistics and Pat wrote me into this and it's all Pat's fault, okay? So some of the things I'm trying to do today with the presentation is to familiarize some of you with some of the vocabulary with statistics and just some of the basic concepts. Okay, so the basic concept of why statistics came about is because of models, okay? So models should be pretty familiar to us because software itself is a model, okay? So there's different models of how the planets orbit the sun, you know, in elliptical orbits. Now, they're mathematical models, so they're not actually real life. That doesn't mean that they're not useful. And sometimes they can be very useful, okay? So statistics came about is because they would have these equations describing the motion of the planets, but the data didn't actually fit the model perfectly. There was always this error function. And it was always off a little bit. So the statistics is actually the study of that error of a model, okay? So for instance, actually what is the temperature of the human body, okay? So this isn't actually just one number because that temperature is going to vary, okay? So we say, well, it's about 98.6 degrees Fahrenheit and but we'll see, well, it's distributed. So this is the standard normal distribution that we're all used to, the bell curve. So the things that we can learn from this, can you see my pointer? Good. Is that the 98.6 would be right here at the average, right? So that's the center of this distribution. And then if we measured it, then it's gonna be a little bit more, a little bit less all the time, a little bit of randomness all the time. But in with the normal distribution, we can tell with the area underneath the curve, how much of the data is how close to the mean. So these are standard deviations. So within one standard deviation of the mean minus one plus, we expect about 68% of the data will fall within that range. And if we extend that to two standard deviations, then we'll expect to see 95% of all the data within that range. So that doesn't mean that all data is gonna be there, but on average, we'll see about 95% of it, okay? So this is the normal distribution. But when we go to measure something, it's not always a perfect normal distribution, but it is a distribution. And so distributions are characterized by where the middle is, which is usually the average or mean or some metric of middle, and then how spread out they are. And that's the standard deviation is the spread, okay? So to kind of explain a little bit more, the play between the two is, here's two different samples. We got the blue and the orange. And these two distributions have the exact same average, except one varies quite a bit more that there's a lot more spread in the data values in the orange. So for instance, if this was a web server, then we say, if we were just looking at the averages, then we say, well, the performance is the same because they have the exact same average, but with one, since it's varying greatly, then we know that something is seriously wrong with our web server and we need to investigate it, okay? So coming back to the finding, the temperature for the human body is that in order to find the true average temperature, so now we're talking average, so there's not a single temperature, it's a distribution. And to find that true distribution or the true average, we would actually have to measure everyone in the world, so that's what we call population. So if we're gonna do all the human race or if we were looking for the average for everyone in this room, we would have to test everyone in the room and get their measurement, okay? So if we're looking about the world, then this is something that's not even possible. Or even what's the average weight of an elephant? There's no way you can actually weigh all the elephants in the world, okay? So what you have to do is you can't actually measure the population directly or exhaustively, so you have to sample it. And all the sample is that we take a subset of a population and then we use that to estimate the main population, okay? So the population is the thing that we actually want to measure and the sample is what we can feasibly measure, okay? But this sample needs to be representative of the population or we're gonna get completely bogus data. So for instance, if we're looking at the average height of people and the sample was my family, then I'm short from my family, I'm six-two. And so we would conclude that everyone in the world is around six-six, okay? So that's totally bogus. So we need to make sure that our sample is representative if we're going to try and infer it back onto a population, okay? So I'm gonna get into a little bit of statistics of comparison, but before we get into that, we just need to know that the sampling is a key part of statistics and that we need to have good samples and that is how we manufacture good data. That's the only way that we know what actually is truly going on, okay? So one run of any benchmark or any test is not a distribution, that's a single data point. And we need to have multiple samples so that we can see the variance in the samples and to be able to identify the distribution. All right, I work for Phil Winley. I think a lot of you may know him by reputation. One thing you might not know about him is that he loves Diet Coke. I don't know. He runs a conference and he makes sure that there's always five times as much Diet Coke as any other beverage, because he loves Diet Coke and he's not gonna run out, okay, for breakfast every day. So let's say that we have Diet Coke and we want to know, can he really tell the difference between Diet Coke and Coca-Cola Zero? Can he tell the difference, okay? So we need to, how can we design this experiment to actually find that out, okay? So if we think about it, like, well, we could give him one of each. So if we let him taste a cup, you know, we're not gonna have a label, of course, then he's gonna pick either Diet Coke or the other. And then the second one, well, it has to be the opposite. He has a 50-50 shot of just guessing that right. So that's not very good. So if we give him just two cups, then we don't know if he's just guessing or if he actually can tell the difference. Okay, so we could give him two of each. So he has a 50-50 shot on the first pair and a 50-50 shot on the second pair. So to get it absolutely right, he has a 25% chance of just guessing it correctly if he can't tell the difference. Okay, so it looks like we're onto something here. You know, if we give him enough cups, he's not very likely to actually get lucky. There's a slight problem with this approach is what if he really can tell and he just makes a mistake? Okay, so it's a model and he's gonna make an error, either a human error or he meant to say one and didn't the other, okay? So what we could do is instead is we just get a huge sample size and we randomize which one he drinks and then we would stop and say, if we give him that huge sample size, we would assume, well, what if he can't tell the difference between the two and he's just guessing? What would we expect to see? Well, we'd expect that probably on average, if he's just guessing, he's gonna get half right just by chance, just by luck, okay? So if he can tell the difference, we would expect that proportion to be much higher than 50%. So this is actually called a Z test. So you can go look that up if you want the actual mathematical formula and how you'd go and execute it, but the point is that we're comparing two things. We're comparing the proportion of, if he was just guessing, to the proportion of what he actually gets right. So this is, statistics also has a logical framework for testing and they have structure to them to see statistical tests. So these tests typically compare two models and you start with a hypothesis that you assume to be true. And we call this the null hypothesis. So in this case, the null hypothesis would be, he can't tell the difference. And so the null hypothesis would say, well, he's gonna get about 50% right. So then we look to see if the data supports an alternative hypothesis. And in this case, it's, well, he can tell the difference between the two. Now a thing to realize about this test is with statistics, it doesn't prove anything. And it doesn't tell you which one of these models is right. It just tells you which one is better for the data that you've gathered. And it actually gets a little bit more subtle than that is with the null hypothesis, you never actually accept one of the models. Is that you either reject it or you fail to reject it. Because we don't know if there is a different model that would actually be better than both of them. So this is just one of the tools we need to apply to the right, at the right time and only on the right circumstances. So an example of this would be AB testing. Okay, so let's say I have a site and I sell stuff and at the site I have visitors coming in and a certain percentage of those visitors become customers. So that's a proportion. We call it the conversion rate of a website or an advertising campaign. So let's say that I come up with a sweet new website, the new hotness. It's just great and I know it's gonna be good, but how do I actually know that it's any better than my old site? So how would we go about testing that? So will the conversion rates of proportion? And so it's exactly the same as the Diet Coke example. Okay, so we would actually have to, in our advertising, so we need a representative sample. And so we would need to make sure that it's representative is we would just pick them at random whether or not they get to go to the new site or if they go to the old site. Okay, and random is because that's the best, it's the best we can do to get a representative sample. Okay, so then we would send people to the new site and then we would compare the conversion rates. Okay, so now there are a few caveats of this. I had a friend that they rolled out a new website and so they did exactly this, they had an ad and they sent a certain percentage of the traffic over there and they found out that the new site wasn't doing as good, it was actually doing worse. So they went looking and they discovered that the new site was broken for IE6, that they were using JavaScript to put something in your shopping cart. And so if you were using IE6, you actually could not purchase anything on the site. This is really bad, right? So then the question is, is that accounting for why it's worse? And you can do that because you can just look at the proportions, so you can just take the proportion of IE6 users and take them out of the sample and then compare the old site to the new proportion minus the IE6 users and see if it was your site or if it was the design completely or if it was just that you made that error that you quickly fixed. So that's a Z test, a proportion test. All right, so let's say that we want to sample for a benchmark instead, okay? So we have a web application and the thing we actually want to measure with a web application is, how many requests can we do per second? So a good tool to use for this is HTTP Perf that we're gonna generate a test load on a web application and we're gonna send requests one right after the other as fast as it possibly can go. So to sample this, so that's kind of the population so we need to sample out of that. So what HTTP Perf does is that you would pick every five seconds for a given second you count how many requests you actually got during that second, okay? So this is an example of all the requests these are the X's and then at the start of the red X right here, this was the start of the second and then we counted how many requests we got during that second and then the next second, well, we're not sampling it. Okay, so we're actually sending through, you know, in this test we might send through 10,000 web requests that'll take about 30 seconds but we're actually only gonna pull out maybe four samples from that out of that population, okay? So of course, you know, our favorite web application, well, web server for Ruby applications is Mongrel. So, a little bit small up there but, you know, is this Mongrel and how many do we need to have? I mean, so, let's see here. Jump! I know we've had a rough day but I'm sure we can put all that behind us. Ah! So I think, you know, Mongrel's is kind of a free-for-all, they're not lined up, you know, they're just going as fast as they can go so we actually need a pack of Mongrels pulling our docs out here. Okay, so what we did is we're gonna measure the performance of the request per second and then we're gonna make a change, okay? So we're gonna change like we're gonna add caching to our application or something and we're gonna see, what did it actually, we made that change, is the performance changed at all? Okay? So in this case, it's not a proportion so we can't use the previous Z test and so we need a new tool. We need the T test. We need the T test, okay? So the basic premise of a T test is that we have this population and we have these samples and so the T test says, okay, I have two samples and there are estimates of the population since we can't measure it directly and so the T test will tell us the assumption for a T test, the null hypothesis, is that these two samples came from the same distribution so basically this means the assumption is that there is no change, okay? And so with the T test, the output of the T test, the most important part is called a P value. So P value is the probability that the effect that we're seeing was just caused by randomness and not by our change, okay? So the T test is actually just a, it's a approximation of the permutation test. The permutation test just looks at all possible combinations in a population and to find out which ones are as bad or worse than the ones that we saw and they just count up, what the probability was that we just grabbed those at random and we ended up with something as bad as we did, okay? So a quick example here might help. This is our, it's an open source statistics analysis program, it's pretty good, it's pretty hardcore though. Okay, so for instance, say we have the numbers one we can ask R to give us the documentation on what is the T test, okay? So one thing to keep it really simple is let's say that we sample something and we have the numbers one through 10 and we want to see if the numbers seven through 20, if those are from the same population. So we can just run a T test. So this is called a two sample T test. So I'm just going to one through 10 and then seven through 20. Okay, that is small. Okay, can we read that better? Okay, so the important parts here is this p-value. We see this is an extremely small number. So that means that if we assume that these two samples came from the same distribution and then we just randomly picked two samples from it that we would have a chance in, what is that, a couple million that we would end up with something that bad. So then in that case then we say, well that's not very probable then. So we would reject the null hypothesis that means that they're from the same population. We say, yeah, these are from two different populations. Okay, so before I showed the graph of the average and the standard deviation and then they can be the same. So let me just recreate that. So I've got A is just going to be, I'm going to sample from a random normal distribution. So you give me a hundred samples with the mean of 30 and a standard deviation of one. Okay, so A is now in that distribution randomly, randomly selected from a normal distribution. And let's put B as a hundred samples. Again, with this exact same average, let's give it a quite a bit larger standard deviation. So it's going to vary quite a bit more. So this is the exact as that graph I had. So if we test these, then we find out that the P value is 0.2. So we've got a one in five chance. So these distributions actually overlap quite a bit. So then we would say, well actually there's a whole lot of evidence that they're from different populations and they're most likely from the same one. So then we would say that there is no change at all. And they're from the same population. Even if one's a little bit faster than the other, well, it could be just due to randomness. So the T test gives us, you know, we've got two samples and we say, are they any different? So like in a performance test, let's say we compare two runs and on one run I use my Mac, I'm using Core 2 Duo, processor with three gigs of RAM. And from other tests, I'm gonna use Linux, AMD processor and four gigs of RAM, okay? So I said I'd do those two runs and I run through my T test. And she says, yes, there is a difference between these two samples, but then the question is, is the T test only tells you that there's a difference. It doesn't tell you what the difference, what caused the difference. And so this is called confounding variables. So the only thing that you can do to combat confounding variables is basically you only change one thing at a time. You keep everything the same and then you only change one thing. And then so that if you isolate it and there's only one thing, that's the only way that you can have confidence that the thing that you changed is actually the cause of the difference. Okay, so there's been a common trend, I guess. I saw a benchmarking test where, you know, to do it right, you don't want your testing code to infringe on the actual application performance. So you really need two computers, right? You need one computer to be the client. You need another computer to be the server so that the client isn't stealing the CPU away from the web application. And so one fellow had the ideas like, well, you know, I don't really have two computers, but Amazon has the elastic computing grid and they get a virtualized server with these specifications. So and that's pretty standardized. So it's been kind of common to do performance testing using Amazon EC2, which is an awful, awful idea. And it's because, well, it's virtualized. So there's multiple hosts per actual physical box and you have no control over how many are on there. You just ask for one and you're assigned randomly basically to one of these servers. So you don't know if the benchmark is changed because of the application code, the thing that you're testing, or if it's due to the virtualization environment or that you just really got unlucky and you're on a really busy host. Okay, so for a benchmark, there's too many confounding variables. You don't know what is causing a difference if there is any. But then on the other hand, if you are deploying your application to Amazon EC2, then that's something you have to put up with. So then you would test for your application. You wouldn't have to have enough samples that you would find an average. So that instead of looking at the performance of this specific application, it's this application on Amazon EC2. That's how you'd have to do that. All right. So in review, we got a few tools now. We got the Z test, the test proportions, and we can use that for it like AB testing. I'll just have the T test to see if two samples are from the same population or not. So some of the takeaways is, sometimes we change something with, that's gonna be faster, but my gut tells me. But if you really think something's better is, how do you actually know it's better if you haven't measured it? Okay, so with statistics, it's really easy to make mistakes. That's why it's really important that we have someone double check us on some peer review. There's some really subtle problems that we may overlook. Even statisticians, when they're working on things, they always collaborate with other statisticians because it's so easy to make mistakes. And so a good way of doing that is to automate your test harness that's actually driving your test so that it's always the same. And one of the nice byproducts of that is that, then it makes your test easily replicated by someone else so then they can peer review it. So I'm sure we all know of a really handy cool language we can script tests with. And to automate our graphing as well. So when we come back. So some resources is the R project. Again, it's open source, runs on multiple platforms. Great for statistical analysis. These guys are hardcore. They got great documentation. A lot of documentation is assume that you understand all the statistic principles. But as soon as you get some of the lingo down then you can kind of figure it out and Wikipedia's your friend. Okay, another excellent resource is on PEEP code. They have an episode on benchmarking with HTTP Perf and it is actually quite excellent where he actually goes through a Rails application and he tests the performance of it and then he compares that with Prili Dynamic, with Page Caching and then with Action Caching. So he's not specifically using a T test in that PEEP code. He actually just looks at standard deviations which is actually the exact same thing as a T test. He just doesn't run the function. He just doesn't, with an intuitive proof. But it is quite excellent, especially his description of how you would sample an application. And of course, Zed Shaw. Probably familiar with his rant on programs need to learn statistics or he'll kill them. Which is, it's a good rant for identifying some of the common misconceptions that we have at times, the compounding variables. We should never measure how many users can this system handle? Because how do you measure that? What is a user? It doesn't make any sense. You actually need to go straight for performance. Like request per second and don't bother with users. That's a meaningless metric. Also, fantastic resources. His write up when he was doing some Rudy Ruby, Odeum and Lucene for text searching. And that is actually a fantastic write up that he did in that he actually steps through the entire process of how he bootstrapped the analysis, how he sampled the test to find out how many samples he actually needed to run. He has on there all of the data and the testing harnesses and all of his analysis scripts with explanations of why he did everything. It's a fantastic write up. Also kind of key here to statistics is just dealing with the data itself and to find meaning from a graphical representation. So of course, Edward Tufty is fantastic for that. I highly recommend the visual display of quantitative information and how you can convey information through, quantitative information through graphs and such. Alrighty, any question A's? Hope not. What do you need in terms of samples? I have to be like a representative sample. If you sample two seconds, a merge of course of like, is that how? Okay, so the question was how many samples do you need to know that you're a representative? What do you need to represent in a sample? Well, a representative sample is, what I mean by that is you can't take a sample of convenience. So if I'm gonna find out what's the average height of people in the world, then a non-representative sample would be just what's convenient, like people in my immediate vicinity. So the best way, so it's really hard to manufacture representative samples, especially if you're not sure what you're measuring. So the best you can do is to randomize it. And so if you randomly select out of that population, then on average, if you keep, and you do several of those samples, then those samples will be representative. You have a run of say several minutes, you pick two seconds, is that enough? One day probably is enough, is two enough? Three? Yeah, because of what you're measuring, so usually like if you're doing a run for performance, then you would, you're gonna take a sample and out of that sample you're gonna calculate an average for that sample, and then you're gonna look at sample averages. So there's the central limit theorem, which means that the things that we're gonna measure aren't normally distributed, but the normal distribution is really easy to work with. But it turns out if you're looking at, you know there's the raw data of the sample and if you look at a mean, or it's called a statistic, so a parameter is a population like the average weight of an elephant, that's a population, and so the average is called a parameter. And out of that sample, if we take an average of the sample, that's called a statistic. So any statistic that you are taking the average of, that if you have enough samples that those averages will end up being a normal distribution regardless of what the original distribution was, and that magic number is about 30. So if you get about 30 samples and you're looking at the mean, then you're good to go. So if you take two samples, such that the average of each sample is the same, you've got a good represent of the sample? No. So the population of Utah, how many are men and how many are women, if you take this room, you're not gonna get a very good sample, but if you measure it and feed it for it, that's right. Because if you're thinking of us, can you ask me if I'm sorry? Sure, but the question was, if you take two random samples and you're not sure of what size would constitute a good random sample, if the average of each of those two samples is, you know, it's close to each other, does that indicate that you've got a good reference? No, not necessarily, because that is completely out of my chance. So there's a whole branch of statistics called the power test. So the power test is, say, I want to be able to make a decision at the end of this with like a 95% confidence that I'm right. And so then you would say, okay, what's my population? What are the basic estimates of those parameters and it tells you how many samples you would actually need to have? Can you address the difference between a random sample distribution and a uniform distribution distribution? What do you mean by that? So random is, there's some relationship between randomness of data and the performance of distribution of that random data. If you're randomly selecting samples of the population, you've finally come to getting some uniform representative sample set. That's true. Is there anything you can do to make sure that that's a reasonable population in comparison to other data? Well, what do you definitely make sure that if you have a good clean data to start out with? I guess I was a little bit confused because uniform distribution is actually something very specific and it doesn't follow bell coverage just the straight line is uniform distribution. But I think your question is, say I have a population and I grab randomly out of it, I could randomly pick a really bad section of it. Is that right? So the way to mitigate that is that the random selection for the sample is that it's not a one-time deal. So you say if we select randomly that on average, if we keep repeating the test that on average it'll leave an out. So one time we could get a really good one, kind of get a really bad one and on the average we're gonna get a fairly representative sample. So it's not just a one-time shot. That's why you have to have multiple samples from the population. Does that answer your question? Well, it begins to. It seems like in the Boston software there's all sorts of crazy artifacts that can deal with the HGD agreement. And like to see versus a certain, maybe a periodic behavior. And it could just be that the randomness of your sampling happens to fall. If you think of the speed in which you can sample data completely misses the flows of all the following changes that's basically where you're doing that. So is there, are there techniques for discovery that you have this crazy outliers that you said the IV6 is in? Yeah. But oh, we're just not getting any data from there? Um, we're going to finish more of an argument this morning about that one. Yeah, it is not. Really not. So, I mean, the only thing that you can do is is that those artifacts is, and if you randomly sample is that, eventually if you have enough samples that you're going to see some of that funky behavior. So, you know, it's possible that you're never going to see it if you're randomly sampling, but that's really unlikely. This has a really low probability for that. So it's just like the size of your error bar. Instead of, we know that there's this many data points in this population where we only sample, you know, an extra sample of them. So that means our error bar is at a certain size. That's right. Exactly. Yeah. You have to take it. That's the tool for discovering those, but then that might be able to discover the outliers, but the statistics isn't going to tell you whether those outliers were just by chance or if it's actually a problem with the system. And so that's why it turns over to, you have to understand the underlying causes of the system. The statistics is just, it's actually really, it's not a very insightful tool. It's a powerful tool in the right context, but it's not this all-knowing thing. You really use it as a tool for your specific domain knowledge. With Asian statistics, do you find stuff like causality? Has anyone ever tried to aim that at HTPERF, or anything like that? Okay, so the question about this with Asian methods and statistics, then it's a way of defining causation. And if anyone has used that in HTPERF, small nitpick, Bayesian statistics cannot prove causation. Okay, so what a Bayesian does is it's, uses a probability model. It's an inverse probability. So it's basically like, I deal you from a deck of cards, I deal you five cards, and depending on the five cards I got, then I can estimate the prior probability whether or not the deck was shuffled or not. Okay, so it's an inverse probability, it's a reverse one. So Bayesian methods are very cool. I don't know if they've been applied to HTPERF or not. That's actually something to ask Zed about, because when I ran into Zed, I was taking my stats course and I was like, and I asked him a question and he was explaining some nuances of the t-test, and then he's like, have you looked at Bayesian statistics? And I had used Bayesian statistics for like machine learning classifiers, but I didn't know that you could actually use it as a general purpose statistical tool. And I just actually found out about that last week of how you actually use it. So, sorry, I don't, this sounds interesting, sounds really cool. So I'm wondering with HTPERF specifically, why would you use a sampling method when you actually measure the entire population and you can measure the entire population? I'm not a statistician. So the question was is with HTPERF, why do you even sample if you can look at the full population? So I know that sometimes you can do the full population, but you have to realize that too when you're doing a performance benchmark, so you're looking at a specific time interval, so that's not the actual population anyway. It's just that segment of a population for a given time interval, which is a sample, right? Because you can't actually measure the performance of all time. The other thing that I know about is part of it is just with with the sampling then part of this computation, not that that's really that awful. I'm sorry, I need to ask that on that one. I don't have a good answer for you, I'm sorry. So, you have an answer? I'd like it. Basically, a lot of time when you're in statistics, you're making decisions on things that don't exist yet. So like your website is not in production and you just want to see what's going to be used, what will have to be used, or how to ran out or whatever. You basically set up a bunch of scenarios and every scenario you run is passed on and they're not real production data and you don't have, you're not going to protect that out of the different IT addresses. So you're basically just using statistics to figure out what's probably the best way to go. So it's not, it's not your data that's not in production, it's that that's real product data. Those are your annual world that you set up in the peripherally different scenarios. And I think one of the things to say, too, is that if you look at the full population, then you get, you have one distribution. Or if you have samples, then each one of those samples is its own distribution. And so that gives you a lot more powerful tools to analyze it. But that was a great answer, I wish I would have come up with it. Okay, thank you much.