 Now let me show you something else. What we're doing there goes by the name Simple Regression. What I'm going to show you here is called Multiple Regression, works the same way except that your relationship can involve more than two variables. We can have several things here. So in this example, you're a trucking company and your goal is to schedule the trucks and as part of your scheduling the trucks, it's necessary for you to be able to estimate how long the trucks will be gone. So on the one hand, I don't want a bunch of idle trucks sitting around. On the other hand, I don't want someone to come in and say, I'd like to move a shipment from A to B and I have no trucks to give him. So this becomes important to me to be able to estimate how long trucks will be gone. And I walk into the room assuming that there are really two things that influence the travel time. Maybe there are other things as well, but they fall out into the noise. The two important things, the things that I can control or have some vision into are these, miles traveled, number of deliveries. So I figured the more miles a truck travels, on average, the more time it's going to take to leave and come back. The more deliveries it makes, the more time it's going to take, because this goes out and makes lots of deliveries, it's going to take more time before it comes back to the shipping yard. So what we see here is data on travel time for a handful of trips that our trucks have made, miles that the trucks have traveled and deliveries that they've made. And you can see in this data set, we've got a total of 87 hours of travel time. These trucks traveled a combined 4,000 miles and made a combined 29 deliveries. So the question is this, how can I use this information to predict travel time? Let's try a couple of straightforward things. So the way I'm going to use this analysis is to answer the question, I have a truck that's going to travel 325 miles and make two deliveries. And I'd like to know, how long is this truck going to be gone? I'm going to use this data to try and come up with an estimate for how long the truck will be gone. And let's start off by doing something that seems pretty straightforward. I have here in my data set a total of 87 hours of travel time. And these trucks that traveled 87 hours traveled a combined 4,000 miles. So if I simply do the division, my trucks on average are taking 0.022 hours for every mile that I travel. So if I've got a truck that's going to travel 325 miles, I can do 325 x 0.022 and I get 7.2 hours is my estimate for how long the truck will be gone. That's not bad. It's a straightforward thing to do. I've calculated average hours per mile and done the multiplication. Here's a problem. I have data not just on average miles on the number of miles traveled. I also have data on number of deliveries. And I could use number of deliveries to estimate how long my truck will be gone. So in this data set, my trucks were gone a total of 87 hours and combined they made 29 stops. That's three hours per delivery. So if I have a truck that's going to be gone making two deliveries, three hours per delivery times two deliveries, that's six hours. Notice my problem now. I've got two conflicting estimates for how long my truck will be gone. On the one hand, if I look at hours per mile traveled, I'm estimating that my truck's going to be gone 7.2 hours. On the other hand, if I look at hours per delivery, I find that I estimate my truck's going to be gone six hours. So which is it? Is it 7.2 hours or is it six hours? You might be tempted to just take the average of the two and say, look, I figure I'm estimating how long I'm going to be gone. If I estimate it according to miles, it's 7.2 hours. If I estimate according deliveries, it's six hours. Just average the two together. All right. It's not very satisfying because it's kind of just ad hoc. Why would you necessarily average these together? Why not add them? Why not do powers or something? A much better approach here is to use what's called a multiple regression analysis. In the multiple regression analysis, we walk into the room and we say, look, I believe there's a relationship between hours of travel time, miles traveled, and deliveries made. And furthermore, I believe the relationship looks like this. Hours is some number A plus some other number B times miles, plus some other number C times deliveries, plus noise. Now, the computer can tell me what A, B, and C are. U, the noise, these are things that affect the hours, the travel time, other than miles and deliveries. So things like, my driver got pulled over and inspected or he spilled a cup of coffee on himself and had to stop. He had a fight with his wife and he's distracted and took a wrong turn. All these little pieces of noise that will influence hours, all that stuff gets lumped into you. And I want to blow all of that away so that I'm looking at this pristine relationship with all the noise gone. Just show me the relationship between miles, deliveries, and hours. So if we run a regression on this, we feed the whole thing through the computer. The computer comes back and says, OK, you've got this cloud of data that's miles, deliveries, and hours. And the line, or in this case, or plane, because it's three-dimensional, that fits the data most closely is this. Your estimated hours that your truck is going to eat up is 1.13 plus 0.01 times miles plus 0.92 times deliveries. So let's look at these numbers separately. The 0.01 times miles, what is that? The 0.01, remember, measures the magnitude of the relationship between miles and hours. So this tells me, on average, traveling an additional mile will add 0.01 hours to your trip. On average, traveling an additional mile will add 0.01 hours to the trip. 0.92 is the parameters attached to deliveries. And that measures the magnitude of the relationship between deliveries and hours. So this says, on average, making an additional delivery will add 0.92 hours to your travel time. Making an additional delivery will add 0.92 hours to your travel time. Now here's where life gets fascinating. Regression analysis, when you have more than one factor in here, like we have miles and deliveries, we've got two factors here, trying to explain hours. When you put more than one factor into a regression analysis, what the regression analysis gives you back is what's called the marginal effect. So the 0.01, technically speaking, we would call the marginal effect of miles on hours. The 0.92, we would call the marginal effect of deliveries on hours. What does that mean? It means that 0.01 is the effect of an additional mile on hours after filtering out the effect of deliveries on hours. 0.01 is the effect of an additional mile on hours after filtering out the effect of deliveries on hours. Similarly, 0.92 is the effect of an additional delivery hours after filtering out the effect of miles on hours. And if you start to think about it like this, you'll notice where we would have gone wrong by taking our simple averages that we had and putting them together. So we had, when we looked at just miles separately and deliveries separately, with miles we had an estimate of 7.2 hours to go 325 miles. And with deliveries we had an estimate of 6 hours to make two deliveries. And so our knee-jerk reaction was, well, just average those two numbers together and you get a nice estimate for how many hours it's going to take you. Here's the problem. Deliveries and miles are going to be related. On average, on average, the further he travels, the more opportunities he has to deliver stuff, so I would expect him to be doing more deliveries. And for shorter trips, I would expect fewer deliveries to be happening because he's not going that far. So because these two things are related, if I calculate my estimate of 7.2 hours based on miles and my estimate is 6 hours based on deliveries and then somehow combine them, I end up double counting. I end up double counting the effect of miles and deliveries because miles and deliveries are themselves interrelated. And this is the beauty of the regression, of the multiple regression. When you see these effects, the .01 marginal effect of miles on hours and the .92 marginal effect of deliveries on hours, these are the effects of miles on hours after filtering out the effective delivery on hours. So we use this term marginal effect to describe this phenomenon. One other thing that we see here, now this is going to your earlier question, what's the A mean? The A here does have an interesting interpretation. So A, 1.13 is our estimate for hours when miles are zero and deliveries are zero. So imagine a truck that goes nowhere and delivers nothing. According to my model, it's going to take 1.13 hours to do that. And you might wonder why it should take any time at all. What is this thing measuring? What it's measuring might be the fixed cost of my driver going to the dispatch office, getting the keys, getting the map of where it is he's going, whatever it is, going to the truck, checking his load, his pressure in the air tires and the oil and the fuel and all of that stuff, backing the thing out of wherever it is and turning on to the road. All of that involved traveling nowhere and delivering nothing, but it contributed to the hours involved, the time. So we think of it as it's a fixed cost. No matter what you do, no matter how many hours you, no matter how many miles you drive, no matter how many deliveries you make, you're going to have this overhead of 1.13 hours. So that's an interesting interpretation for A in this example. So now we'll come to my point. My point was I've got a truck that's going to travel 325 miles and make two deliveries. I can plug it into my estimated regression model to 325 for miles, a two for deliveries, and the thing comes back and tells me 6.2 hours. Does this mean that my truck will take exactly 6.2 hours? No, it probably won't take 6.2 hours. What it says is, on average, a truck that travels 325 miles and makes two deliveries, I can expect to take 6.2 hours. And when it doesn't, it might be more, it might be less, that's random noise. Random things happen that have nothing to do with miles, nothing to do with deliveries to influence the actual hours. So we can turn now, we were talking about magnitude effects, we can talk to turn now to the p-value. So I have this hypothesis after accounting for deliveries, my walk-in hypothesis is after accounting for deliveries, miles have no effect on hours. So my walking-in hypothesis is there's no effect here. So I look at the p-value that goes with my miles parameter and I see a very low p-value. Very low p-value means the data contradicts me. The data says no, you are wrong. There does appear to be a relationship between miles and hours, even after filtering out the effect of deliveries. Similarly, I can say, look, my walking-in hypothesis is after I account for miles, deliveries have no effect on hours. And again, I can look at the p-value, I see a very low p-value, meaning that no, the data contradicts me. There does appear to be a relationship between deliveries and hours. And then finally, for this data set, I'm getting a r squared of 0.9, which says what? Of all the things that influence the travel time. Of all the things that influence travel time, miles and deliveries account for about 90%. The other 10% is due to the noise, the spilling of the coffee, the argument with the wife, they're getting pulled over by the police. The point of regression is to filter out the noise and to find the underlying real effects. And the reason this becomes very important in social sciences, particularly in economics, is because it's so difficult for us to conduct experiments. When you can conduct an experiment, you control for things. So I want to know if I feed a plant coffee versus water, will the plant grow better? So what I do is I control for everything I can control for. I have two plants, I put them in the same temperature, the same humidity, the same light level, all of this, I feed them the same quantity of liquid. The only thing I change is what the liquid is. This one gets coffee, this one gets water. And I observe, I measure how much they grow over time. In a controlled experiment, the point is to control everything except for the one thing you're testing for. That thing we're going to vary. And so when I see a difference in the plants, I conclude it must be due to the coffee versus the water because everything else was the same. In economics, you rarely can do that. I can't control for things, I have to just take the data that's shown to me. So when I take the data that's shown to me, if I put it into a multiple regression model, I get the same sort of thing. So I say, for example, I wonder if, I wonder if increasing the income tax rate would cause people to work less. And so I look around, I have data on how many hours people work, and I have data on their marginal income tax rates. And I can run a simple regression and I can show some results and someone's going to put their hand up and say, but wait a minute, there's lots of things that affect people's willingness to work other than their income tax rate. How do I know that this difference I'm seeing is due to the income tax rate, not something else? If the person were a plant, I'd put one of them in this box and one of them in this box and everything would be the same except for the tax rates. And I watch how much they work. I can't do that. So what I do is I run a multiple regression model. And on this side, I have the thing I'm trying to explain. How many hours do you work? Over here, I have the thing that I'm asking, does this affect that? And that's the income tax rate. And then I have a whole bunch of other stuff. And all these other things are the things I'd like to control for, but can't. So things like the household income, the household he lives in, his education level, his age, how much money he has in his savings account, all these other things that might affect also his willingness to work. And when I run the regression model, what I get over here is the marginal effect of the tax rate on his willingness to work. That is, it's the effect of the tax rate on his willingness to work after you filter out the impacts of all these other things. So multiple regression is an attempt to get what I would get if I had run a controlled experiment, but I can't run a controlled experiment. Do you take any issue with that sometimes? Like I know that one example we always learned in school was that's how they got the value of a park. But what are the kind of things that are unseen if that park wasn't there? And is that an appropriate way to value a park? Right. So first off, in many cases when it comes to economic data, there are problems with running regression analysis, but it beats the alternative. And the alternative is throwing up your hands and walking away and doing nothing, right? What's important is that we'd be aware of what the problems are. So one problem, for example, is I'm assuming that when I make this list of things that affect your willingness to work, that I've identified all of the important ones, I may have left something out. And if I've left something out, then the results I get are not meaningful anymore. The technical term is they're biased, right? So I'm going to get numbers over here that might look and smell and taste good, but actually they're meaningless, right? They're wrong. Another problem is I've assumed that the relationship is linear, right? This whole thing that when we do regression, it's all linear relationships. A one unit change in this causes some fixed change in this thing over here. It's possible the relationship isn't linear. Maybe it's the case that at low levels of tax rate, raising a tax rate a bit has a big impact on your willingness to work, but at high tax rates, maybe raising your tax rate doesn't have much of an effect. It's a nonlinear relationship. If it's a nonlinear relationship, and I'm I put together a linear regression, then I'm again going to get results that aren't meaningful. But I have to be aware that that's a possibility and there are all kinds of tests in the background that I can run to verify if indeed I've set this thing up correctly, right? So it's not it's not as simple as simply throwing the things into the pot and seeing what comes out. In fact, I tell my my students who have reached this level that they are now officially dangerous. They know enough to be able to put data into the machine and to run the thing and to get results and to interpret the results. But they don't know enough to be aware of where they might have gone wrong such that their results good as they look are actually meaningless. So how much can we trust stats that we encounter kind of in the media or in a newspaper? And what are some red flags to look for to kind of see bad stats? Yeah, that's a good question. So, you know, this goes to the thing that people like to say there, you know, lies, damn lies and statistics, you know, to which I like to respond, there are liars, damn liars and people who don't understand statistics, but repeat them anyway. I think the problem with statistics, now, there are stats out there that, you know, are just wrong. Usually you find them in memes. But if you're if you're looking actual stats stats from reputable, you know, places, government statistics, UN statistics, Gallup, these kinds of things. The problem isn't the statistics. Statistics don't lie. But the humans who are presenting them can. For example, there was a survey of something 100, 150 economists not too long ago asking about the minimum wage. And in a large number of the economists reported that they didn't think that the minimum wage caused unemployment, or at least this is the way it was presented. So what you saw for people who were pro minimum wage who presented this research, they said, look at all these economists, you know, one third of them or whatever the number was, it was a large number conclude that increasing the minimum wage does not does not cause unemployment. And you look at that and you say, well, wow, the economists think minimum wage doesn't cause unemployment. The problem wasn't the statistic. The problem was the person who was repeating it. If you dig into the question that the economists were asked, they were not asked, does increase in minimum wage cause unemployment? They were asked, does increase in minimum wage cause significant unemployment? And that's what you got, you know, one third of them saying no, they weren't saying no to unemployment causing or to minimum wage causing unemployment, they were saying no to unemployment, minimum wage causing significant unemployment. You know, did some of them maybe think that they had no effect at all possibly? But the question wasn't asking what the person who was repeating the statistic said that the question was asking. And that one little word, significant, maybe didn't, I'm not saying that the person who was repeating the statistic was deliberately lying. The person may not have understood the significance of that one little word, but that one little word makes a big difference to how the people who read the question answer and can make a big difference to what the statistic actually means. So in summary, I would say the thing to be careful of isn't so much the numbers as it is the person who is telling you what the numbers mean. So what you've described is a very empirical approach to economics and social science. How does this contrast with the more a priori approach of deducing from some basic assumptions and principles about purposeful human action that's employed by Austrian economists? Yeah, this it's an interesting question. Actually, I've written what I thought was a pretty good paper and lots of people write me even still about it for Kato on exactly that topic. And the question is how do I as an economist who loves the Austrian school, but also loves statistical analysis, rectify those two viewpoints. And I don't see them in opposition. Remember, as we talked about statistics and p values and regression, at each step, in each example, what I said is you walk into the room with an assumption as to how the world works and you ask the data, do you data, confirm or deny this thing that I'm assuming. And left unsaid here is how do you come up with the thing that you're assuming that you're walking in with. And the Austrian approach gives us lots of interesting things here, principally because they start from from first principles, right? This is this is how we believe humans behave. And if humans behave this way, then the following things result from that. That's the hypothesis that we walk in the door with. One of the one of the things we haven't discussed and is true and it's a problem is data mining. And this is becoming more of a problem as we get more and more data in computing power that you know, on the phone in my pocket, I can do stuff that NASA couldn't do with its roomfuls of computers 30 40 years ago. One of the things that means is it becomes very easy to search lots and lots of data very quickly. And if you have the ability to search lots and lots of data very quickly, you will find stuff that looks real. It looks like, you know, sales of keyboards are influenced by colors of pens that people have. And you look at the data and lo and behold, every time when we sell more red pens, keyboard sales go down. We sell more blue pens, keyboard sales go up. We found this wonderful thing. What you found is a spurious relationship by random chance that happens to be the case. Now the real problem with data mining is you multiply by orders of magnitude, the likelihood of you finding spurious relationships because you're looking so you're looking at all kinds of things, right? You're guaranteed if you look hard enough, you'll find these spurious relationships. What's needed to help guard against finding spurious relationships is connecting the analysis you're performing to some some defensible hypothesis. That is, there is no defensible hypothesis that says colors of pens sales influence keyboard sales. So I shouldn't even be looking at that relationship. What the Austrians give us is the reminder that we must be rooted in first principles. And that guides the sort of things that we analyze. And to that extent, I think the two data analysis in Austrian economics actually are complementary. Humans are storytellers where we like stories from the time, you know, we're little and people tell us things. And when we're older, it's anecdotes. Anecdotes catch our imaginations. They stick in our memories. And I think humans have evolved to be anecdotal creatures because, you know, prior to the invention of writing and widespread literacy, which is a very recent thing, that's the way we pass down wisdom from generation to generation. You know, what plant not to eat, what animal to stay away from because it's going to eat you. These sorts of things, you tell stories about Uncle Joe, who got too close to that big cat and it just bit his leg off, right? No, God, I'm not going to do that, right? And we embellish it with all kinds of other things to make it more scary so the kids don't go anywhere near the thing, right? And we're anecdotal creatures. And anecdotes give us on top of that and entertaining a colorful way to look at, to grapple with the world around us. The problem is, good decisions are more often made by statistics or analysis of statistics than by consulting anecdotes. And the problem with statistics is they're not colorful, they're not interesting, right? We have to force kids to sit through stats classes because nobody likes this stuff. So I think one of the takeaway messages here is, while there's no way to make statistics more palatable, it is at least necessary to communicate to people who maybe aren't interested in knowing about statistics. There are great drawbacks to using anecdotes as your map for the world, as you go about making decisions. And at least help people to understand things like, you know, when you make a decision to ban a drug or to, or to ban a particular type of weapon or to, you know, subsidize this thing or tax that thing, that it's important that we withhold our colorful, our joyful, our desire to help others to do the right thing to behave well, to hold that back a little bit and remember that the data are the things that we should consult in making decisions. Your heart is a wonderful thing to consult in asking the question, what is it we should be making decisions about? But the decision itself needs to be driven by data.