 Hi everyone, sorry about that. The typical thing with not sure if I was recording, so I had to check. I hope everyone is doing well after surviving winter storm URI. I'm still trying to get everything reorganized personally. As I see how everyone else from the department to doctor's offices go about rescheduling. And as I try to reconcile various people's schedules with various other people. So I have extended the due date on the next lab by one week for the time that campus was closed, plus an extra three days because of all the rescheduling madness, including the fact that it put me an extra day behind. If anyone else is still running into those kind of issues where one professor reschedules something to one date and it conflicts or some other thing just wasn't rescheduled maybe because it's virtual and it's in another area and you need an additional allowance, please let me know. There's no issue with that. Our goal here is for you to be able to learn what you need to learn. The next lab will be March 8th because comprehensive exams were rescheduled to next week and that is always a lab closed event for the department. So since that's always a lab closed event, there's no lab next week. Also, I'll be busy during the normal lab time because I do have an exam from eight to eight that day. So March 8th, another note. I have uploaded absolutely everything for the course except for the videos to the GitHub repo. That includes all of the lab and lecture slides, the lecture scripts, the lecture data, my homework, I art one script for R, the homeworks, the homework data, the reading assignments, all of the lab and lecture handouts. Everything that was not in the initial lab stuff for the GitHub is now there. The only things you're not going to find there are the videos because they're too large and because the lecture videos, I don't think that the professor wants uploaded except to Microsoft Streams. The lab surveys, at this point, I have not actually put the surveys up just because they're plain text and I cover all the questions in the lab slides. I may do that later, but other than that, everything is there. This should mean that you have a well-organized place to keep everything for this class once it's done when you need to refer back to it and hopefully that will be useful when you're doing your own research later. For those of you who are feeling particularly daring, you might want to try creating your own GitHub branch at some point so that you can upload your results to your own branch without affecting the main one. That way you'll be able to upload whatever you produce, including your own homework and all of that to your own repository. And if you want to do that and you want to put it off till the end of the semester, which might not be a bad idea, towards the end of the semester, I can go through how to do that with you. It'll be a learning experience for both of us because it's not something I've done before either. But for anybody who's feeling daring and wants to learn a new skill with me towards the end of the semester, we can definitely do that. Okay, now I'm going to jump into the actual material and I'm going to go through the follow-up on statistics from lab three. I'm going to do the overview of lab four statistics. Then I'm going to go into the technical labs section. I'm going, that will include the survey from last week. Then I'll do a follow-up on the coding questions from last week and then I'll talk to you briefly about the coding for this week. So as far as follow-up on statistics from last week, there is nothing to see here. There was nothing major. I'm going to cover some minor points in the survey answers, although for the most part, everybody got the answers right. There were some things where people answered, a different interpretation than what I was asking and that was fine, but everybody got something out of it and everybody did the work. I will go through just what my ideas were when I asked the questions. As far as the upcoming lab, the one that you're about to start, I have some concepts and comments I want to go over with you. First of all, this is a really interesting article to me. It's a political economy piece. It's a public choice piece that looks at the incentives to basically please voters among highway patrolmen. So it's an interesting piece. It was actually last fall, fall of 2019. When I was taking the graduate political economy class in our department, someone that I know was taking the undergraduate political economy class in the econ department. Which is econ 4389. It's a senior level class. The professor that teaches it also teaches game theory. I took game theory with him. It's an intense, both of them are intense classes. It is comparable to our, we shared readings back and forth and it was comparable in difficulty to our graduate level political economy class. We did not cover this paper, but they did. So it was just interesting for that reason. If you have any interest at all in political economy and public choice, this is a paper you should look at. And the full reference to it is on the lab worksheet. So as far as what you're going to go over this week, testing slope coefficients is the point of this lab. You're going to use t-tests and confidence intervals to examine significance. So the t-tests are sort of the formal way that are not necessarily really easy to conceptualize spatially in your brain. Confidence intervals are a little easier to picture. You can actually plot them out. We had some code for doing that in I think the last lab or the one before and in some of the lecture examples. So if you wanna do that, you can actually plot those confidence intervals out in order to visualize them if you need to. If you don't wanna plot them out in R, you can still kind of draw them out by hand and sort of visualize how they overlap or don't overlap. You're going to look at the significance of a single coefficient to see if the slope equals zero, which is the null hypothesis. If it equals zero, the coefficient isn't significant. If you have a coefficient of 0.15, that may or may not be significant and that's what the tests tell you, whether it's statistically different from zero. You're also going to look at two different coefficients to determine if they're equal, which is the null hypothesis. So you might have a coefficient that's 0.1 and another coefficient that's 0.15 are those actually different statistically. Or are they statistically the same? You'll also look at some things that should be getting more familiar to you. This includes the variance covariance matrix of the estimators. You're going to use that in this lab. It was used in the last lab. It was used in a recent lecture. I think it's going to be used in the current lecture. So it gets a lot of use in this course. It's important to understand how to derive it both statistically and by code. I guess that deriving it statistically by hand is probably a bit cumbersome, but you should at least understand what the things are that are in it. And then the ANOVA test, which you have dealt with for months now and you will see again throughout your career, it's kind of a workhorse. It ranks right up there with OLS regression in terms of being a workhorse. So you'll see it all the time. And so you're going to see it this week. Some of the things to look for this week, you're going to compute standard errors using that variance covariance matrix of the estimators. This is again, statistically, something that you want to understand how to do. You want to understand where it's coming from. It will make you a better consumer of other people's work and it will help you in doing your own research. Computing the t-statistic, this is really important. So the variance covariance matrix, getting the standard errors, puts you at a level that you really want to aspire to, but it's not the kind of thing that you're necessarily going to do day to day. Understanding where the t-statistic comes from is the kind of thing that every time you open a journal article and you see a regression table should be in the back of your mind. One reason for that is that depending on the journal, the format of those regression tables may not include the t-statistic. And you want to be able to quickly look at the coefficients and the other things that are in that table and say, hmm, that t-statistic is or isn't significant so that you're not just looking for stars. And typically in those journals, if the t-statistic isn't reported underneath the coefficient, something else is and the something else is what you have to use to derive the t-statistic, the math on that is simple enough that if you think about how t converges to z, which is something we covered last semester, then typically you don't even really have to get out of calculator. You can look at the two numbers and say, oh yeah, that one number is big enough compared to the other that it's greater than two. So the other thing you're going to look at is computing R squared and adjusted R squared. We went through this last week. R squared is important. Adjusted R squared is equally important because that's just adjusting for, as a model becomes more complex, you expect R squared to improve because you're throwing more stuff in. But you don't want to end up with one of those kitchen sink regressions that has this huge R squared just because you included all the data and you don't have any theory behind it and maybe you're actually overfitting. So you may have something that just, because you included absolutely all of the data, of course everything is explained, but it may not have any meaning at all if you get outside the current sample. So that's why adjusted R squared is so important. And you're going to see the formula that shows the relationship between those two. And while you don't need to memorize that formula, you do have to understand the relationship. Again, to be a good consumer of other people's work. Finally, look for the relationship between the F statistic and the T test. It's a pretty simple one. You should understand it conceptually and you should also understand the really simple mathematical way that you get from one to the other. So going through last week's survey, question one, everyone got right unless you did each other's work, in which case I would not know because it was just your name. Question two, this is one of those getting to know you kind of things because we're not actually in class together. It's the kind of stuff that would happen in discussion before class. Although I will add that when I took this course, this is how Adam took attendance every week. He handed around an attendance page and had everybody fill out these kinds of questions. So favorite foods, mine, pizza, because it can be anything. It can be everything from an appetizer to a snack to a main course to even a dessert. You can have some pretty good dessert pizzas. You can also throw pineapple on a main course pizza. I won't hate you and I will confess that I think it can be pretty good. We had several votes for sushi and sashimi, which I agree, those are pretty good. We had various pasta and noodle dishes. A couple of votes for pizza, so that's good. No extra points, but I agree. The one that stumped me was Donbury or Donbury. I'm not sure exactly how you pronounce it, but it is a Japanese noodle dish I had never heard of. I've never had it and I will be trying it at some point. I said noodle dish, it's not, it's a rice dish. So it's just, I had to look it up and it reminded me of like a rice, like the poke bowls, but I'm sure that I have that completely wrong and when I try it, I hope I'm surprised and find something completely different. So onto actual work, question three, was just your questions and so I'll cover that later. Question four, I just was asking about in this particular line of code, what do these numbers mean? And I also asked what kind of matrix you were making. Everyone got, it was a correlation matrix, so the key there is C-O-R, it's a correlation matrix. The numbers inside the bracket, they represent the columns or variables. I was looking for columns, but variables is right and the reason I was looking for the word columns is because it's important to remember that that's the relationship with R when you get, and this is not a statistics thing, this is a coding thing, when you get to the point that you're looking for the variables, you want the columns, rows are observations and almost every case in the dataset that you work for, work with, it's already going to be set up so that the rows are the unit of observation and we had this last semester, we covered this a little bit, look for the rows, those are the unit of observation and then the columns are the variables. And so if you wanna get the variable for a particular observation, you need the row number and the column number. In this case, we wanted every observation from each of these columns. There are going to be some exceptions, you will get some datasets where they make the observations go across the columns and the variables go across the rows and one of the very first things you're going to do when you run across those is do a thing called reshaping where you basically flip it so that it's the way that everybody normally works with it, which is rows are observations. You know, rows are observations. So question five, I just asked you for a couple of statistics, the R squared was 0.707 and the residual standard error was 10.3, I think everybody got this. So not much to say there. Question six, I wasn't real specific so some of you gave me actual numbers but I just really wanted to know direction and significance of the change from the base plus one to the base plus two, which was adding meal, reduced price meal percent, what happened to the percent of people receiving California works, I think it's called, but it's the California welfare program. So the effect of the California welfare program got smaller and the effect lost significance. And going into question seven, we're gonna see why. So there are two related reasons for that change, ones based in theory, ones based in statistics. Why is there a relationship between the California welfare program percentage and the reduced price meal percentage and what are the statistical and theoretical reasons? And then as far as the statistical reason, you can summarize it with a sentence something like and fill in the blank. So from the theory standpoint, the reason is that welfare participation and reduced price school lunches are both based on family income. So they're both related to economic status. The statistical result of that is that they're highly correlated and correlated was the fill in the blank part. The statistical effect is something that we should expect given the theory. So this goes back to the big point from all of last semester, know your theory. If you were building this model, you probably would not have ever wanted to put reduced price school lunches in if you were working with welfare participation already because you should have expected this. Now you might have gone ahead and run the correlation just to see and if you found out they weren't correlated, you might have made a big note and written, wow, this is a question I've got to answer later, but you probably would not have wanted to try to put the two of them in the model. You should have expected some multicollinearity if you put these two things in the model. So it's probably not something you should have ever done. And of course, it's not your fault you did that here because the lab told you to. So question eight goes into that multicollinearity. Multicollinearity is often a result of too little free something in the explanatory variables in a small sample and the word I was looking for was variation. You could have said movement. Anything that was a variation on variation or movement is okay. If you said variants, okay, you still got credit. I would not say variants and the only reason I wouldn't say variants is because that has a specific statistical meaning. So even though the English word variants kind of fits in there, I would avoid it and go with variation. So you can see more information on this in section 3.4 in Woldridge and section D in the worksheet. I wanna make sure you understand this is a non-trivial point. It's been non-trivial for me in my own research. I have had, I have ongoing research where I don't have significance because of multicollinearity, because of small sample size. One of the other reasons that you can run into something sort of similar is when you have rare events. If you have small sample size and rare events then you are just doubly out of luck on this. But if you're doing international relations or comparative politics, either one, I can tell you for certain, at some point this will be a non-trivial point that you will have to consider. If you're doing American, I expect it will be also. If you're just, I don't wanna say just. If you are a theory person, you may not be doing a lot of statistical work. If you are a theory person, you are coming at things from a totally different perspective. And so this kind of thing may not be super important. But then again, it may because more and more theory people are trying to look at things like relationships among words and they're using statistical techniques to do so. And actually some of those techniques are even more advanced. And so you may run into this there as well. I actually wouldn't be surprised if you did. If you start trying to apply statistical techniques to text, I would expect that that's also going to be the case. So I don't know enough about that to say for sure, but I just, I really wouldn't be surprised. So question nine, I didn't even write down what it was, but nobody had any problems with it. Question 10, again, nobody really had any problems with this. It was just what were the three counties in the subset? And it was Los Angeles, Orange and Ventura. I will say on question nine, if anybody had a question about it, just get ahold of me and we can go over it. The extra credit question, this from the worksheet, there were two different numbers, the summary of the residual from the final regression and then the mean residual should equal zero. But then you get a slightly different result for the standard deviation of the residual and the residual standard error. Do you know why? This is an answer from someone in the class. I won't name names because I don't have permission, but this is a great answer. The reason is because of this very small difference in the denominator inside the root for the SD formula where it's N minus K, and then in the same spot in the equation in the RSE formula is N minus two. Several of you talked about how it was because the model was very well-fitted. So I think that the fact that it's very well-fitted might be the word slightly. That might be what you were thinking of, why is the difference slight is because it's very well-fitted and that's right. I mean, to that extent, the reason the difference is small is because it's well-fitted, but what we're looking for were two things. Basically, why was there a difference? And so this explains why there was a difference and so what I would encourage you if you had a different answer to this is don't spend a lot of time on it, but try to at least consider would the fit of the model explain what we're talking about? That is, would a model with the same fit still produce this difference if it weren't for this difference in the equations? But again, this is a pretty good explanation. So kind of think about this and then maybe look at the equations. Not something you need to spend a lot of time on but that's the answer there. So I'm gonna go into the issues with the last lab. There were a couple of script issues. I'm pretty sure that I've gone over this one individually with the people who are still having problems but in case you are still having problems, this package here is supposed to make things easier than having to worry about working directories and getting your data in the right place and all of that because all of the scripts for the rest of the semester are going to be set up to use it and as long as you just pull the latest from GitHub you'll be able to run the scripts as is and you shouldn't get errors. But you do have to run this piece of code. It's a single line. It was in the week three script but it was commented out. You only have to run it once and you really don't want to be running it every time. So that's why I commented it out. I didn't wanna run it every time I went back through to redo something in that script. So, and to make sure everything was working right. So I commented it out but all you gotta do is go back, delete that hashtag or pound sign, whatever you wanna call it, highlight that line and run it and then you won't need it anymore unless you reinstall R but if you do that, go back, rerun that lab three script and you'll be all set. There is a line that will be in each labs code that will call up that library. So you will have to have that or else you'll get a different error but that will be in every lab that you need it for and it's just the library here command. So the other issue was on attaching data. So the data in this case was the CADTA data. Attaching data frames, first of all to the person who had this, if this was causing an error, it's a good catch figuring it out that this was the fix. So I wanna say that to start with because the person solved this for themselves but this also should already been taken care of in the models so that attaching the data wasn't necessary. It may be that the person who did this saw this in the script and thought it was going to be a problem and added in attach in which case, again, that's a good catch because it wasn't there but it actually shouldn't have been necessary and there's a reason that it wasn't there. So I wanna talk about the reason it wasn't there because it's kind of important. So the last semester and some of this semester we have used attached data frames but you've probably noticed at various points that you get this error message. It's not really an error message, it's a warning message that says blah, blah, blah is masked, M-A-S-K-E-D from whatever. So that's just an R term that means that more than one object has the same name coming from two different sources. So it may be that you have two different packages that use the same name for something. It may be that you're working with two different data frames and they have the same name for something. For example, a lot of data frames have the word use the variable year and so if you load two different data frames that have the variable year, you're going to get that. Year is masked from whatever whenever you load the second data frame. So the issue is if you get two data frames with a variable named X, if the variable's year it's probably gonna work out okay but it might not but if you have a variable named X and you attach both of them then when you go to run something R has to decide where to get X from. Most likely R will still run whatever commands you give but you won't know where the X is coming from. So you won't know did X come from this data frame or this data frame or from this function out here in this package. So the preferred method is that you call the data frame directly. This can be done several ways. If you're working with a single variable the dollar sign method, you put the data frame, dollar sign, variable name. You don't have to attach, this will call it directly. The way that was done in this lab that is kind of ambiguous and looking at the code you might have not realized it was even being done after the comma here there's this C-A-D-T-A well that's the data frame name. This line actually means the same as this line but this third line I would say is the preferred method and the C-A-D-T-A may need to be in quotes. R will let you know if it does or not. You'll get a pretty obvious error if it's in quotes and doesn't need to be or if it's not but I'm going to say this is the preferred method when you do a model because it's not ambiguous. Someone looking at your code can immediately see this is the data. This is where the data is coming from. Someone looking at this line if they understand the specifications for the LM function they may look at this and say okay the first thing after the comma is supposed to be the data and so they may look at that and say that's the data but they may also look at it and go what does C-A-D-T-A mean? They may look at it and go, they may go look at it and they may go Googling trying to figure out what the heck this function is and just spend half an hour trying to figure out your code. So the preferred thing is this, the other thing is that because this is more precise you're less likely to make a coding error. So I would just for both style reasons and avoiding errors I would recommend this method but if you do this you don't have to attach the data frame. The one thing that you will miss if you don't attach the data frame is that if it's attached you can start typing A, V, G, I, N, C and when you get to about the G it'll bring you up that list and you can just click to get A, V, G, I, N, C and the same on this. So you'll have to type your variable names out all the way which is just a minor inconvenience compared to producing bad results. So this is the preferred method. Coming up in this lab some things to look for. First of all you're going to be computing some of these statistics multiple ways. This is important both conceptually in terms of statistics and it's important for coding reasons. It's important because sometimes one way is going to be more simple than another way and then in other cases computing the same statistic with a different data set it may be easier to use the other method. So it's just important to have multiple ways. If you have more than one way to do something it's kind of like you've got a Swiss army knife instead of just a pocket knife with a single blade. You can still turn a screw with the single blade pocket knife but if you've got that multi-tool that has the screwdriver on it it makes life a lot simpler. The second thing you're going to look for is how to address specific model elements. And what I mean by that is when you have a model that you've run like in this case we're running regression models once the model has been run it's out there in the environment, it's saved and all of the results are out there and you can call them. So if you wanna know the specific coefficient you can call it. If you wanna know the standard error for a specific coefficient you can call that. If you wanna know the intercept you can call that. If you want to know the adjusted R squared for the model, bam you can get that. You can take those things you can feed them into functions. You can save them as objects. You can create a new data frame that has specific things out of that model. You can create a variable in a data frame with things from one model and then another variable with things from a different model so that you can compare two models. Knowing how to pull those things out especially as you get into more advanced work or where you're looking if you start doing replication projects of other people's work where they have done this if you have some idea of how this is done you're going to be a lot less lost. You're still probably going to have to go through and really think through what they're doing but you should have an idea of what those things you should have a base to start working with and so this really is going to give you some help with this and I mean I've gotta tell you going through preparing this script for you and preparing this presentation I started thinking about some things that I, you know, where I've used this and where I've maybe wished I had a little better grasp of it at the time I started messing with it so it really can be useful. The topic for this week's lecture one of them is mediation analysis and there's a package that will do mediation analysis for you and it will work with regular logit models using the GLM package which we actually used at the end of last semester so I'm not getting way ahead but it won't work with, there's a thing called a panel GLM model PGLM and that mediation package won't work with it because the model isn't set up the same way and so in order to work with that mediation package you have to go in in order to make it work I'd have to go in and pull the model elements individually, create a data frame feed them into the mediation package. The workaround I used in that case was just to see whether it was to do something else that I could use the GLM package but I could have pulled the elements out of that PGLM object and fed them into the mediation by doing this. So again, that sort of comes back to that idea of it's good to have more than one tool in your toolbox it might actually have been quicker for me to do this and in the future that might be my go-to way of doing it. So we're also going to continue with the thing of addressing specific rows and columns, specific rows, I'm gonna say it again, units of observation, specific columns, variables. If you want the specific variable for a specific observation, you go with the observation row comma variable column number format. So we're gonna continue doing that. Finally, a big thing that you're going to do here, you're going to subset to eliminate NA results as you work with a lot of bigger data sets that are available either for free or commercially replication data or the big data sets that are out there like the VDEM data or the ANES data, they're going to have a lot of results where there's just not a number there and so R will fill in this NA when you import the data and the NA will cause R to throw a lot of errors when you run different packages. Some packages will absolutely hate it. Some of them you'll be able to put in an NA result or an NA action that will let it ignore those but sometimes just ignoring it isn't what you wanna do. So step one in all of this is figuring out how to eliminate results that have NA's and so there's going to be one of the simple ways to do that will be covered this week and so that's one of those things you're going to wanna have in your little multi-tool. Later on, you're going to learn other results, other ways for dealing with NA results. Some of them will be better for some circumstances than this but this is sort of the base that you start with. So that is basically it. I will see you all March the eighth for the next live class.