 a scatter plot is a graph which is just a collection of individual points on the plane So you might have like your x in my axes you have all these different points Just scattered around on the screen now. It could be that We could come up with some function that can model the data So it's like who I found a function that models all this data Maybe but it could also be kind of like you see in the picture here that maybe there is no trend No correlation to the data whatsoever We often graph the data sets so that we can see with our own eyes Is there any type of relationship between these data points? So let's look at a couple examples the scatter pot you see on the screen right now It compares the number of passing touchdowns of NFL quarterbacks from the 2011 2012 season Versus their base salary. So how many touchdowns were they throwing versus how much they got paid and you can see When you look at this graph right here, that doesn't seem to be any correlation between How much the quarterbacks are getting paid with how many touchdowns they were making you look at a data point like this one right here? And this one right here where these two quarterbacks They were making lots of touchdown passes, but their salary was pretty small So in some respect these these quarterbacks are definitely being underpaid On the other hand you look at say like these this data point right here where this Quarterback wasn't making a lot of touchdown passes, but was making the most money of this whole group, right? So it doesn't seem like there was any connection between the pay in the touchdown passes Why they were getting paid so much probably had more to do with celebrity or some other metric besides Touchdowns, which of course you would think would be a good thing for a quarterback in the game of football On the other hand We're gonna look at this we're gonna look at this graph this time this scatterplot where again the Horizontal axis is gonna be the number of Touchdown passes, but this time we'll look at the quarterback rankings In which case you see there actually is a pretty strong correlation between the two that that is the bigger your ranking corresponds to the more passes you made and The fewer passes you made the lower the ranking was and so this kind of leads to the idea that when you look at this data It almost looks like the data fits some type of line that we could draw a Diagonal line that for the most part captures the relationship between The touchdown passes and the ratings of the quarterbacks. Is it gonna be perfect? No, but we could use this line to try to make some predictions about the data Now this line that we're talking about is referred to as a regression line or sometimes it's called the line of best fit as Another example, let's consider a local school district that wants to evaluate the relationship between class size and performance on Standardized test so it collects data about okay Here's the different class sizes that they have in this school district 15s being the smallest 30 being the biggest and then they looked on the average on the standardized test and we could see this Downward trajectory that as the class size got bigger The performance on the test seemed to go down there was this Relationship that almost looks like a linear relationship. And so when you graph this red Regression line or this line of regression line of best fit you can kind of use that then predict what's going on that Every time I go down this many units I'll go over this many units and we can actually then calculate and interpolate What data we can expect that is what would happen if we had like a 22 size class? What would we expect the average? test score to be So I want to admit to you that most calculators can cure programs like you can do this on your standard Texas instrument graphing calculator or Microsoft Excel can do these things as well But there's many many programs out there equipped with the ability to insert scatter plots And then with that scatter plot you can construct a line of best fit those the so-called least squares regression line Now that regression line it takes us beyond The scope of the course we are in right now I mean uses some techniques from linear algebra and statistics, which I don't want to introduce right now But given that we are able to draw lines I mean after all a line is just determined by two points what one could do is try to Just wing it that is could we eyeball the line of best fit Can we just draw ourselves a line that seems like it fits the data without any sophisticated algorithm whatsoever and answer again And so to illustrate this I actually want to use the following example Imagine we have an outdoor snack bar that collected data showing the number of cups of hot chocolate cocoa We're going to call that C. They sold when the temperature For the day was t Degrees Celsius so they want to find a relationship between the number of cups of hot chocolate They sold versus how hot it is it makes sense that the hotter it is outside the less likely people are going to buy hot chocolate Now of course if you're like me then I like to drink hot chocolate all the time much in the same respect I like to eat ice cream all the time when it's super cold outside You can catch me eating ice cream when it's super hot outside You'll be catching me drinking hot chocolate and ice cream of course. This is this this isn't a seasonal thing This is just for me all time So I'm sort of a weird outlier in that regard But if we look at the data set that this this snack bar collected they have the following So we see things that like when it was two degrees Celsius outside. That's really close to freezing I mean in in Fahrenheit for those who aren't familiar Celsius a zero degrees is the freezing point of water Zero degrees Celsius and therefore that's about 32 degrees Fahrenheit, right? So two degrees Celsius. It's really cold They were selling lots of hot chocolate 45 cups They sold in a specific afternoon and it seems to be going up progressively when that when it's four degrees outside they They sold 42 cups when it's 10 degrees outside They sold 25 cups when it was 18 degrees outside They sold six cups things like that right and so you see this the scatter plot on the screen right now Now if we wanted to we could construct a line of best fit and it would look something like the following, right? I mean we it's gonna be somewhat subjective if we don't try to use the algorithm, right? But we might get something like that We were trying to find something that seems to match the data pretty well And so I'm gonna draw a line that looks like the following again very very subjective here the The least squares regression algorithm I mentioned earlier removes that subjectivity from the situation But we could still answer questions, you know for our snack bar manager here who might not know much about linear algebra They're still very well equipped to answer questions and make predictions using this least this this sort of guess on the line of best fit So two things I'm gonna mention here is that well, okay? Let's first start off with an equation. Can we come up with a formula for this line? Now a line is determined by two points So if we can find two points on the line we can we can find an equation now Don't worry about the the scatter plots themselves the points that gave us the line look at the line itself and try to make some predictions Like so for example if we look at when the temperature outside was eight degrees Celsius that looks like it's a little bit above 30. So I'm gonna say that's like 32 degrees Again, that's just kind of an estimate there and then that's another point. Maybe you come over here when it was 16 degrees outside Again, that's a little bit above 10 And so I'm gonna say that was 12 12 cups of cocoa were sold on that given day So I'm gonna take these two specific data points points that are on the line I just drew 8 comma 32 and 16 comma 16 comma 12 and I'm gonna use that to build a line So the first thing I would do is look at my slope formula my slope It's gonna be rise over run and notice how I oriented my graph I put as the horizontal axis the temperature and as the vertical axis the cups of cocoa So my I'm thinking of when the temperature is such and such what will be the number of cups sold that day So we would take 12 minus 32 divided by 16 minus 8 We see here that 12 minus 32 is negative 20 over 16 minus 8 which itself 8 you can simplify that fraction be negative 5 over 2 or you could you could do Negative 2.5 if you prefer that would be the slope of this line Then using the slope for this the point slope form of the line. I'm going to get that the number of cups Take away and pick your favorite data point of the two blue points We chose I'm gonna use the first one of 8 comma 32. So we're gonna take c minus 32 This is gonna equal to negative 5 halves times t minus 8. This is my point slope form of the line I'm now going to solve this equation for c that is putting in slope intercept form to do that We're gonna distribute the negative 5 halves that is we'll distribute the slope so we get negative 5 halves t Plus because it's a double negative now 5 halves times 8 well 2 goes into 8 4 times 4 times 5 is 20 And then we're going to add the 32 to both sides of the equation and now we have our formula ready to go the Number of cups of cocoa we sell should be approximately 5 negative 5 halves times the temperature in Celsius plus 50 52 Like so so this is going to give us our model We just built this model by looking for an equation of a line that fits the two points that seem to be on the regression line And so coming down here then we're going to use this equation to try to make predictions about How many cups of cocoa are we going to sell with specific temperatures? So looking at the first one, let's say that t is equal to 9 degrees Celsius How many cups of cocoa will we expect to sell on that day? Well, the number of cups is going to equal negative 5 halves times 9 plus 52 For which we're going to get 9 times 5 is 45 over 2 plus 52 Probably want to switch this to a decimal. I mean I've been using fractions with decimal might be more helpful here 2 goes into 45 22.5 times plus 52 And so then we would add the negative 22.5 to the 50 to the 52 and that's going to give us 29.5 now you can't have half a cup of hot chocolate there So really would be saying something like okay, we're going to sell about 29 to 30 cups on this day And again, this is just an estimate We don't expect that this model is gonna be 100% perfect, but we expect the error to be relatively small We expect within a certain margin of error to sell 29 to 30 cups of hot chocolate on this given day Well, what about I'd have a hand. What if we tried The same the question again with 24 degrees Celsius Well in that case the cups of cocoa would equal negative 5.5 over 2 times 24 plus 52 here For which case 2 goes into 24 12 times 12 times 5 is going to give us 60 so you get negative 60 plus 52 That adds up to be negative 8 negative 8 cups of cocoa. Does that even make any sense right here? Clearly you kind of can't get less than than zero in a situation And she mentioned that models have limitations, right? The thing is once you get sort of outside of reasonable data set You can't expect this model to be perfect all the time We could probably say something like we would anticipate to not sell any Any cups of cocoa on this specific day and again for those who are not familiar with Celsius measurement 24 degrees Celsius would be about approximately 75.2 degrees Fahrenheit and so that's a pretty warm day outside So you won't expect to sell a lot of hot chocolate on a given day unless of course it's me You know, I like hot chocolate still even on a hot day even on a nice nice Spring day like 75.2 degrees Fahrenheit, but again models Have limitations. You can't expect them to be perfect We expect there to be some error when it comes to a model But when it comes to models, we expect if it's a good model the error will still be small And we can come up with linear regression lines to match up with the data we see in the scatter plot Clearly the better the algorithm you use to find the line the less erroneous your data is going to be But for our purposes, we could draw a line that looks like it fits pick two points on that line Not the data set necessarily but pick two points on that line And then build an equation line that matches that data and we can use that to make predictions about linear relationships on data