 Hi, in this video, we're going to walk through an example of one way ANOVA. And in this analysis, the example we'll use comes from research conducted by people in my own department. That's the Department of Energy and Mineral Engineering here at Penn State. So the researchers here predict the volume of emissions that will come out of PA-based facilities under three different policy scenarios. Scenario one is the base case, which is that PA does not join the regional greenhouse gas initiative, which is essentially a cap-and-trade program, and keeps the existing alternative energy portfolio standard, which mandates a certain amount of energy because it's come from alternative energy sources. Our second scenario is that PA joins RGGI and keeps the portfolio standard. And then the third scenario is that the PA joins RGGI, but drops the portfolio standard. It should be noted that this research was conducted in 2021 before Pennsylvania joined RGGI in the summer of 2022. So let's get down to the analysis. So our hypotheses, which we'll always have with ANOVA, are that all three means across all three groups are equal. So our three groups are the three different policy scenarios. And our quantitative variable here is the volume of emissions. To simplify things, we're just going to look at the volume of nitrous oxide, but do bear in mind that the dataset does contain other pollutants as well if you want to look at those. So we're saying that the volume of nitrous oxide, on average, is equal across the base case scenario, RGGI, and no AEPS. The alternative is that at least one of these means is not equal from the others. So let's get down to it. Let's see what the data say. So first thing's first is to import the data, and I've already done that with these familiar commands here. One thing that should be noted is that we've got three distinct Excel sheets. So we're going to read directly from Excel using read underscore Excel, this function from Pandas. We can do that, which is nice. Another thing too is to simplify this analysis, we're only going to look at the year 2030. So if you open up one of these Excel files, you'll note that it's got multiple spreadsheets by year, and so we're just going to pull out the sheet for year 2030. So we've imported these three Excel sheets as data frames called base, RGGI, and no AEPS. Let's examine one. Let's just take a look at base here, and so here's our data. Now the task at hand for data processing is that we need to merge all three of these data frames. So if we want to compare these three, it's helpful to have them in one data frame so that we can make a fair comparison using our statistical tools here. So first things first is we need to, if we're going to merge these, we'll go back and use our merge function that we learned about in lesson two, and we need to identify some keys for doing this merging. And so we can tell the merge function what our key variables are by setting them as indices. So we'll use this set index command to define the key variables, and these are the variables that uniquely identify the rows here. So we need to provide a list here, and the variables that do that, if you go and inspect the data set, is the unit unique ID, the latitude, the longitude, the state, the county, and the field type, okay? And we'll use these same ones. So all three of those data frames are structured the same and have the same variables. So we'll do this for all three data frames that we have here, and just change these names, so our GGI, and no A E B S, okay? So all those, the set index function is doing is designating these variables as indices. Another thing that we'll do to set up the data before doing the merging is to add a suffix to the variables in the data frames. And what this will allow us to do is distinguish the variables that are in base from the variables in our GGI and from the variables in no A E P S, because these variables are all currently called the same thing because they're in separate data frames. But when we merge them together, we don't want to get confused about which column belongs to which. So add a suffix to just make sure that they're distinct. So use this add suffix command and to base, we'll add the suffix underscore base. And same thing with our GGI and no A E P S, except we want to change these suffices to our GGI and no A E P S, okay? And then we can do the merging. So we'll make a new data frame. Let's call it DF underscore merged and for data frame that is merged. And from the pandas library, we'll use our merge function and we can't merge all three at once. So we'll merge to have into one data frame and then this merge data frame will merge with the third with the third item. So let's first start off by just merging base in our GGI, our left index here, say left index is true and right index is true just to indicate that we're using the indices here and we want to keep everything. So we'll say how is equal to outer. So we're going to do an outer merge and we're going to keep everything. And then once these two are merged, what we'll do is we'll still have, we'll overwrite our DF merged again using the merge command here and we'll use all of the same settings except instead of base, we're going to use DF merged. So we're going to first merge base in our GGI store that in DF merged and then DF merged is going to be used in another merge command to attach not our GGI, but the no A E P S data frame. Okay. And then we'll visualize how that plays out. Okay, so we'll go ahead and run this and there we go. So note that these items that are in bold, these are the indices. So from setting the index, these are all in bold here. And then we've got and then we've got all our variables. So the different pollutants, here's NOX that we'll be looking at and note that we've attached the underscore base to distinguish that from the NOX that's currently hidden by this dot, dot, dot by the NOX for underscore RGGI and no APS. Okay. So that, that all worked. Well, and you might notice that, you know, you might be wondering, oh, why did we include latitude, longitude, state, county and fuel type in the indices? But even though that they're not really keys, that we've got, you know, we've got the same values across these first five entries, for example, for those variables, whereas really unique, unique ID is, is really the key here in that it's, you know, the unique identifier for each, each row. Well, we, it doesn't hurt to include these variables in the keys and, and essentially retain them for, for the items. These are properties of, of the individual units. The units here are, are basically power plants, factories and things like that. And so we can retain these variables by, by including them in the indices without duplicating them across all the different data frames that are merged in here. So anyway, a little bit about that. So let's move on. So once we've done that, let's take those variables that we have set as indices and bring them back as variables. So we'll reset in index, this reset index command will just turn these variables that were currently indices again into variables. So let's check that out. There we go. That worked. They're not bold anymore. We have these row numbers as bold. Those are indices now. Okay. So that's all well and good. We've got those three data sets merged. Now the other thing that we want to do is we want to format our data set so that we've got one quantitative variable and one categorical variable. Now the categorical variable is going to contain our groups. So it's going to, each category will represent one of the policy scenarios here, whereas the quantitative variable will contain our variable of interest. The thing that we're going to calculate the mean upon, which will be the emissions in this case, for this particular example, the NOX emissions. So in order to create that format, we're going to use the melt function here. And again, you can go back to the lesson two or reference the pandas cheat sheet to learn about the basics of the melt function. But with this melt function, our ID vars, whatever we specify an ID vars is going to be left alone. And so here we'll just specify unit unique ID. So we'll keep that as is. Now everything that we specify in value vars here, these are going to be called names that are going to be turned into a new variable. And so this is going to include our NOX base, our NOX RGGI. And our NOX NOA EPS. So these three columns, these are the column names. These are going to get turned into a new variable here. And we're going to call that variable pollutant, we call it policy or whatever we want, just call it pollutant here. And our value name is going to be emissions, because whatever we specify, basically all the values that are under these three columns are even going to get put into a new column that we'll call emissions here. But again, we could call us whatever. So let's see this in action. Go ahead and run this. And there we go. So it's taken again, these three columns out of the original data frame above, put those into the pollutant column here. So NOX base, NOX RGGI, NOA EPS. And then it's taken all the variables that are under each of these and put them in the right columns here based upon where the unique ID falls. Okay. So we've got that nicely arranged. Now one thing that bugs me here is that we've got a pollutant variable and it starts with NOX and that's fine. We could, you know, this adequately describes the policy scenario, but one nice touch would be to separate out the different policy names here. So let's just do that real quick. So what we'll do is we'll make, we'll keep our pollutant column, but we're going to change this a little bit. So we're going to overwrite it and we're going to make a new column called policy. And so what we're going to do is we're going to take each of these entries and separate out the first part, which is the pollutant, put that in the pollutant column. And the second part after the underscore, we'll put in the policy column. And in order to do that, we're going to use the string or str split function. So we'll call our current pollutant column, say dot str dot split. And we'll tell it what thing to split upon and that is the underscore. So you want to split upon the underscore and another argument, we're going to give it here is expand equals true, which is basically to take. So after splitting on this underscore, we'll have two strings and this will expand it into two separate items, not one single string that contains everything but the underscore. So let's see how that plays out. There we go. Now we have we celebrate a pollutant column and it just contains the pollutant type, in this case, Nox, because that's all we're looking at here and now we have a policy column that's got the policy names. And these are our three policy scenarios. So base, the dot dot dot is covering up the RGGI, but it's there and then our no AEPS. Now, the third thing that we want to do is you might notice that we have some NAN values here. Let's just get rid of those. This is important to do because our one way ANOVA test won't work if we have NAN values. So our NANs are just in our emissions column and really that's the only thing we need to remove the NAN values from because that's the variable that we're going to plug into our or that the one way ANOVA will mainly operate on. And we only have these NANs under the no AEPS scenario. And you can get that by inspection of the dataset here. I won't waste time doing it here. And we'll just use our drop NA function to remove those NA values. Okay, so there we go. Now another good thing to do is to do some exploratory data analysis. So here we'll look at the mean values in each of our groups. And so let's make a new data frame where we just kind of clean this up a little bit. We'll drop some labels here. So our unit unique ID we'll get rid of. And actually we don't need to keep pollutant either because it's NOX for everything. We don't need the unit unique IDs because we don't need to distinguish the different units from one another in the subsequent analysis. So we're just going to get rid of those labels just because they're currently unnecessary and we'll do our then we'll do our group by and we're going to group by policy and we'll calculate the mean within each group. So we're going to calculate the mean emissions within each type of policy. Let's just rename some columns here because once we do this this group mean we'll have a column called emissions. That would really be much better called mean. And then just one other thing that we'll need to do in here is reset the index after doing the group fire. So let's see what we get as a result of this. There we go. So we've grouped by policy. We have our three different policy groups. And we've got the mean NOX emissions within each group here. And certainly these mean values are different from one another. So none of them are equal. But the real question here which we'll answer with the ANOVA test is whether these differences are significant or not. So do we have a statistically significant difference of means here? So let's go ahead and do that test. So we'll use the built-in function or I should say the function that's available from the sci-pi.stats package here. And this is just stats.f underscore one-way. The F is referring to the F statistic that's inherent to the ANOVA test. So that's why it's called that name. You don't see ANOVA in the test. It's referring to F test which is again in reference to that F statistic. And so we want to give it the three series of values from our quantitative variable in this case emissions. And we need to pull these values out of each policy. So we need to do a bit of a filtering statement here. And so first case is under the policy that's equal to base. And we can include our, just for safety's sake, we'll include our drop NA here. And just copy and paste this for our other scenarios too. And we'll keep the drop NA in there. It doesn't hurt. So even though there aren't any NA's in these other two, it won't affect them. Just won't drop anything. But we'll certainly get rid of the NA's from this no APS here. Okay. So there we go. We're supplying the emissions values for the base case. And then separated with the comma from the emissions values from RGGI case. Another column in our third argument is emissions from the no APS case. Okay. So let's go ahead and run this. We run it and we get some output. We get one line of output here. We've got our statistic. So this is the F statistic value and we have our P value. And you might note that this P value is pretty high. It's much higher than 0.05. So our default significance level. So we fail to reject the null hypothesis. So our conclusion here is we fail to reject the null hypothesis at the 0.05 significance level. Exclamation point, capital W. And so what does this mean? It means that these data do not support the conclusion that any policy scenario leads to different mean emissions from the other scenarios. So with respect to just the mean emissions, there's no significant difference. This is interesting. Clearly the base case is quite a bit higher than the other two. And you would think that adopting RGGI would lead to a significant difference. But this test is saying that's just not true. In order to illuminate this conclusion a little bit better, let's visualize it. This will make a clear explanation here. So in order to do this, though, we need to calculate some error bars. So bear with me here. So I'll make a new data frame. We're going to take our DF2. We're going to group our DF2, by the way, as what we have up here after dropping these variables. And we're going to group by our policy again, except instead of doing the mean, we're going to calculate our standard deviation on the values. And we'll rename some columns here, because again it's going to call it emissions. When in fact it doesn't have the emissions, it has the standard deviation of the values there. All right, so let's do that. And then we need another summary statistic here. So I'm going to borrow the same structure. We're still going to group by policy, except instead of doing standard deviation, we're going to count. That is, we're going to count the values in each policy group. And again, we need to rename the emissions. And we'll rename this instead of a CD, we'll rename it count. And then we'll do some more merging. So our new DF3, which has the means in it already. So this is our DF3 currently. We're going to add on the results of these other group summaries. So PD merge, DF3, we're going to tack on our DF4. And we're going to merge on policy. So policy will be our key here, because that's really making each row distinct. And then similarly, once we tack on DF4, we'll also then merge on DF5. And then we need to do one final calculation here. And that is, we need to find our standard error, capital E. And this is equal to our standard deviation divided by the square root of our sample size, or our count. And this formula is coming. You can go back to the lesson six to see where this formula is coming from. Basically, you can see the traditional inference stuff there on competence intervals and where this comes from. So let's see what this looks like now. There we go. We've got our mean, our standard deviation, our count, and our standard error. All right. Finally, let's make a nice plot here. So we'll use plot nine, we'll import everything. And we're going to make a ggplot based upon this DF3 info. And what I would like to do is I would like to make a bar plot that's showing, that's breaking up our values by policy and plotting the mean for each policy. Now here, we just have a single value for each policy. So what we need to do is we need to set that or statistic to identity, meaning that it's just going to use this single value for the height of the bar. You can take a look at that. There we go. There's our bar plot showing that base has a bit higher emissions. These two are pretty close. But that's just showing what we calculated already before we did the hypothesis test. Let's add in some error bars because it's the variance about these means. That's the critical information. And that's why it's called analysis of variance, is because we want to see how much these mean values could vary. So again, we'll have an error bar for each policy. And we need to specify a minimum, so low level of the error bar. And this will be our mean minus two times our standard error based upon our going back to confidence intervals. So we have our statistic plus or minus two times our standard error to give us a 95% confidence interval. And so that would mean then that our y max or the top of the error bar should be plus mean plus two times standard error. Okay. So let's see what this looks like. Oh, one other just aesthetic thing here is that we can specify width of the error bars. Let's say 0.2 should look nice, but we can adjust that if it doesn't. And there we go. We've got our error bars. So the error bars, these are showing our 95% confidence interval for the mean and otherwise showing how much this mean could vary. But then a reasonable realm of possibilities. And so interpreting our hypothesis test result that was much larger than 0.05 and where we failed to reject the null. Well, we see that the mean of each one falls within the 95% confidence interval of all the other bars. So that is considering the variance of the means, these overlap significantly. There's no real big significant difference. And so that's the critical aspect to ANOVA is that you need to take into account the variability of the mean and not just the single mean value itself. Okay. Thank you. And that's it.