 Statistics and Excel, scatter plots with car-related data. Got data? Let's get stuck into it with statistics and Excel. Desktop version of OneNote here continuing on with our theme of taking data and creating a pictorial representation of it, allowing us to get a different view, a different angle at that data, hopefully helping us to extract more meaning from the data than we otherwise would be able to do. Now in prior presentations our primary pictorial representations have been the box plot or box and whiskers and the histograms. Great tools for taking one type of data set and trying to get a feel of that data set, looking for example the midpoint and the spread of the data. However, note that usually those tools are used for one data set, not multiple data sets, although you can think about multiple data sets by having a box and whiskers side-by-side or histogram side-by-side or possibly graphing one histogram on top of another histogram in the same chart which could give you a nice pictorial representation. But now we want to think about a scatter plot type representation. When we think about a scatter plot representation we're usually looking at two sets of data and we're trying to determine if there's going to be a relationship. Do they move together, these two sets of data? Now remember that we're always kind of thinking as human beings if they move together there's a cause and effect kind of relationship. That's what we kind of jump to but we can't really, we have to be careful of jumping to that conclusion. With the statistics of course what we're doing is trying to say okay is there a statistical relationship that these two things move together and if there is then we can go on and try to determine if there's if one is causing the other ones. Now we're always keeping in mind whenever we whenever we explore these concepts is one the statistics we can say these two things are kind of happening in correlation to each other but then the question is is there is there one causing the other or is there some other factor that's causing both of them that we have to always keep in mind and we also have to keep in mind whether or not we're getting the cause and effect wrong the chicken and the egg problem which one came first which one is caught if there is a cause and effect relationship it's possible to get it backwards and those are two things that we have to be very cautious of. So remember the statistics just doing this the box plot or the scatter plot is going to help us to get a visual representation to see if they're kind of moving in alignment and then we would have to go from there to further determine if there's a you know a cause and effect relationship. Alright so here's our information we've got car related information miles per gallon there gallons cylinders displacement horsepower weight acceleration model and these different factors in a car like if you're thinking about purchasing a car clearly they could have impacts on each other. So for example you would think that if you had more cylinders in the car that that might lead to a higher horsepower. So you might say that there's kind of a you would think there might be a relationship to these types of things if you have more cylinders in the car you would think that that horsepower might go up and as horsepower goes up maybe maybe the miles per gallon isn't as efficient you would think right if you have a high horsepower car especially if you're driving it in the city where you have a lot of stop and go traffic then you would think that then that might be miles per gallon reduce the miles per gallon the weight of the car you would think the weight of the car might have an impact on the the miles per gallon of the car and so on acceleration similar kind of thing you think you would think that the weight of the car might have some impact on the acceleration so you can we can plot these some of these data points together you can imagine multiple different ways that you can plot these data points, these two sets of data at a time, two at a time on the x and y axis to try to determine if there's some kind of movement that is happening together. And then of course, you've got to think about, well, what, what is the chicken and the egg thing? What's the, what's the cause is one cause and the other, or they moving together because of some third factor that is involved, we'll talk more about that later. But for now, if we look at this data, we're going to make something like this, which is going to be our scatter plot. Now just logistically note that in Excel, it's it's what happens oftentimes, if you just want to have the easiest way to input the scatter plot, then you can just put these two sets of data side by side. So in this case, we've got the horsepower and the miles per gallon that we're going to be looking at. And so here's our data set. So we just copied them from here over here, you don't have to do that. But that's the easiest way for Excel to just automatically create the scatter plot. Pretty much correctly. So if you were to just highlight this data and create the scatter plot, then it would put on the x axis, the horsepower. So here's the horsepower on the x axis. The the x axis is traditionally what we would think of as the independent variable. And so and then we're saying that the dependent variable, the thing that is that is moving because of horsepower, and that thought process would be the miles per gallon. That's just how we normally think about it when we plot it. You could think about it the other way and you could try to read the graph as this was the this was the independent variable. And this was the dependent but traditionally, of course, we put on the x axis what we think of as the independent variable. So it's moving independently. And then the miles per gallon up here. So note, another classic example of this is like is like SAT scores, if you're in like a college type of situation. And you're trying to look at test scores that a bunch of students took before they went into college, and you're trying to see if the test scores have an impact on their outcome, their grade point averages in college. That's a classic type of example. And in that type of example, note that the people that we have all the test scores, we have all the SAT test scores, right? So so we have all the data, and then we have the data related to the grades. And we're trying to then plot to determine the depend the independent variable in that case would be the SAT scores. And then the dependent variable, we're trying to see if there's a relationship or if they move together would be on the y axis for the actual grade point averages, right? Here, we're saying, okay, the horsepower, in our case, we have the horsepower that we know all the data for the horsepower. And we're trying to say that is that driving is the relationship between the horsepower and the miles per gallon. So if we look at the horsepower, then this is sorted from lower from highest to lowest in the horsepower. So on the highest point of the horsepower, we're at the 230. So if we go all the way out here somewhere to the horsepower somewhere out here is the 230 on the x axis. And if I then move that on over to the to the y axis, the miles, the miles per gallon, you can see the miles per gallon is lower than some of the other points, which is kind of what you would expect a higher horsepower car has has the lower miles per gallon. If I go all the way down to the bottom of the chart somewhere around like 96 on the horsepower, you know, would be around 96 around here, that's somewhere in the middle somewhere in here. You can see that if I take these spots, they're higher on the miles per gallon, which is kind of what you would expect. So you have some one of a relationship, the and then the question is, is the horsepower the thing that's called causing the miles per gallon situation, or are they both moving in alignment with some third factor, which we'll again, we'll talk more about later. But you can see that there's there's somewhat of a relationship. It's not a perfect relationship, of course, because we're plotting all the points here. And then we're trying to see a trend, whether there's a trend that is happening. Now oftentimes, the next thing that we would do is create the trend line. So this you can do in Excel by by adding a trend line. Oftentimes, we're looking for a straight line. That's what we would like to typically do. We would like to create kind of a straight line if possible if there's a relationship that looks similar to a straight line, so that we get the function of the line. Now this is similar. When we saw like the histograms, and we said if we can create a curve of the histogram that we can make a function for that would be useful, because then we can use mathematical equations to make predictions based on the equation. You have a similar thing here, right? We know that this everything doesn't fall in on the line. But if we can create a line that approximates the data in some way, that would be great. Because then then we can then we can at least make an equation and we can make predictions on the equation, which aren't perfect. But there's some kind of correlation that seems to be happening here. Now if there wasn't a straight line, and you had some other kind of curved line, then again, what we would really like to see happen is to have some kind of line that we can put through the data that approximates the data set that we can make a formula for. Because again, if we have the formula, then we can do more approximations, the straight line would be the easiest formula to be dealing with typically. So when you see something like this, clearly, the more tight the dots are around the center line, the more correlation there is the more they are moving together. And that's different from the question of which is one cause and the other. But but the more they're moving together in that sense. So so that means that this this line will possibly give more more predictive value, the tighter the dots are together around the line. If the dots are nowhere near around the line, they're scattered all over the place. And you can't you know, they're basically randomly scattered. Well, then you don't have a very good correlation, you can try to draw a line in it. But that line is not going to give you very much predictive power. If you're trying to determine, you know, horsepower and then the miles per gallon type of thing, right? So we're trying to because there's not a big correlation. Now in Excel, note that you can enter the plot, you can add the line. And then you can display the the equation on the line if you so choose here. And of course, you can also add it's quite important to add the x and y axisies. Because you and that's more important oftentimes than the title here. Because the title of this, you can you can make the title we're comparing, you know, the miles per gallon gallon and the horsepower. But the important point is you could see that on the graph. The important point is well, which is on the x axis, and which is on, you know, the y axis, so we can see what's going on. Now, if you were to analyze this data, once you once you do this, you might start to think, well, why does this kind of happen like that? You know, over here, you've got like very low horsepower. And we're always I'm always looking at the horsepower on the x first because I'm thinking of it as independent. And then I'm looking at that and comparing that to the miles per gallon on the right. And it's like, well, on much better miles per gallon with these very low horsepower, and then somewhere around here when you get to this kind of mid range of the horsepower, you've got a bigger spread in the middle and but a tighter line. And then down here, we have this very high horsepower, you actually have a little bit, a little bit higher miles per gallon than some of these items down here, where the horsepower is lower, but you've got you've got a lower miles per gallon. So these these some of these don't really follow the trend. So you might try to figure that out and say, well, why exactly might that be? I don't know, I'm just kind of but you can kind of kind of throw out theories from that point and try to speculate what is actually happening with the graph. I mean, maybe for example, these very high horsepower cars are not being driven in more populated areas or something. And therefore, they're actually being a little bit more efficient on the miles per gallon, because they're driven maybe by the way they're supposed to be driven, and more wide open areas. Whereas, if you have some of these more higher horsepower cars that are somewhat in the middle, but are being driven in downtown LA, where they're where they're going to be moving, you know, to two miles an hour and a stop to traffic, then you would think that would be highly inefficient for a car. So I don't know, I'm just, you know, then you can kind of speculate and try to see, well, what is actually going on with this relationship. Now also note that if you think of the miles per gallon as the independent, you can kind of do that, but that's not traditionally how it's graphed. We usually think of the X as the independent, but you could say, well, what if the causal relationship, what if I assume the causal relationship is the is the other way around, you know, in some case, you know, you're going to say, well, if I have the the miles per gallon up here, then then if you can imagine, is that causing, you know, the horsepower to be to be here, right, to be lesser on the horsepower. And then when I when I say the miles per gallon are quite low, I could say if I imagine this to be the independent, I could say there's the independent, and then the dependent would be that there's a higher horsepower, right. So so obviously that's probably not exactly the case because you could imagine making a very inefficient miles per gallon, you know, gasoline, guzzling machine car, which still doesn't have a lot of horsepower. So but in any case, so there's so there's that one. Now we like we can look at another one here. This one is the miles per gallon and the acceleration. So in this case, we're imagining the independent variable being the miles per gallon here and then the acceleration being the dependent variable. So once again, we took the data from our data set. And you don't have to do it this way, but I copied the two components that are most relevant and put them next to each other. And I put what I'm going to say is on the x axis, which is usually the independent variable on the left, the dependent variable on the right. And then if I just highlight that data and input a scatter plot, excel will in the most part get it right. And so now you've got this relationship, which you've got this scattering here, which you can then create a line through. And then if you're close to that line, you can look at the formula for the line and possibly have some predictive power of that line. So here we have a similar thing. We've got the miles per gallon. If I have a low miles per gallon, then I result in and I'm always looking at the x first and the acceleration a little bit lower on the acceleration, right? And if I have the if I have the miles per gallon up here somewhere, miles per gallon, you know, way up here, then then the acceleration is actually a little bit higher. And so you can kind of see a relationship here. And again, the question would be, well, there's kind of they're kind of moving together is one causing the other or maybe there's a third factor involved and we'll talk more about those situations in the future. But for now, just for constructing the scatter plot, we're usually trying to compare two sets of data. We usually want to be graphing on the x axis, the independent item on the y axis, the dependent item in excel. That is easiest to do if you have your column of data for the independent or x on the left and you're dependent on the right, you can just select the data and import the graph. Once you do, I would usually remove the title unless you want to add the title as well, but then add the accesses because it's quite important for the scatter plot to know which is the x, which is the y axis, then it's often useful to input the regression line and and then I have the line, the trend line, and then you can also then put the formula in place. I think those are just some useful, a useful kind of format to be creating the scatter plot using the tool of excel.