 Hi, in this video, we're going to take a look at the underlying data for the EPA's flight tool. These are data on greenhouse gas emissions from facilities all across the United States by facilities. I mean power plants, waste processing centers, factories, and so on and so forth. So I highly recommend you click on this flight tool to explore the database in another format. If you click on that, you'll go to their web app. So this is the flight tool here and here it's a map based overview of the data. You can zoom in on different localities and see where these facilities are located and information about them. You can also filter by different facilities and variables and so on and so forth. So go ahead and check that out. However, what we're going to do here is we're going to look again at the underlying data. We're going to do our own thing with it outside of the web app. And we are going to ultimately find which state has the largest amount of emissions from facilities. So this is not emissions from the automotive sector or anything else, just from facilities. And along the way, we're going to look at histograms. We're going to see how to calculate the mean and the median, and then ultimately how to summarize all the emissions within each state. So let's get to it. So we're all familiar with importing the data. We've got that those lines of code here, except there's one difference now that you may not be familiar with. And that is the skip rows argument. So why do I have skip rows here? Let me explain. We take a look at the underlying CSV. Our data frame or our table doesn't really start until row four here with the column headers and then all the data below. We've got three superfluous lines here that really aren't part of the data frame at all. And we don't want to include. So we just need to skip those three rows. So, and this makes this incredibly easy. We say skip rows equals three. So we skip over those unnecessary lines there. And here highlight this, this highlights the importance of taking a look at the CSV file before doing the importing just to understand exactly what is in this in this data file and you can open up the CSV files and in Google Sheets or Microsoft Excel. So once we have this imported, let's go ahead and take a look at the size of this data set. And one handy function to do this is shape. And all the shape gives us are the number of rows in this data set and the number of columns. So 15,386 some odd rows, rows here are different facilities. We've got 66 columns. Furthermore, we should take a look at what the columns are exactly. So let's print out the column names here. So you've got facility ID facility names, city, state, etc, etc. And one really important variable here is total reported direct emissions. So we're going to focus on that, although there's a lot of other interesting things here to particular the industry type so what type is what what form of industry is producing these emissions, lots of other useful things here. So, let's get down to it let's start by finding the histogram. So, with a histogram we want to make sure that the data that we plot and our visualization in the histogram spans from the minimum of the data to the maximum data at least needs to cover the span of the data. So if we want to find the min and the max. Let's just say GHG min. So we're going to find our minimum total reported director missions. We just use the min function. Very easy. We'll do the same thing for the max. And let's print these results. We'll say min is GHG min and max is GHG max. Our direct emission values for all the facilities span from zero to some really large number here so what is this about 20 million more than 20 million. So very large number here. So let's see what happens when we make a histogram of this. So we'll call our data frame. And one easy way to make a histogram is through pandas we just say histogram dot hist. And we give it the column that we want to make the histogram of this is in this case total reported director missions, and we'll run this. And so here we go we get our we get our histogram. And what is done is it's used some, some predefined, you know, rules of thumb to, to make adequate bin sizes of our histogram here. So the histogram is counting the values that fall within specified ranges. And, and then plotting those frequencies as, as bars. So this is essentially a, a frequency table represented graphically, where the categories are just different ranges of the quantitative variable. However, when we use the default method here, we don't know what those ranges are. And that's a little bit problematic. We're not sure, basically what the breakpoint is between this first bin and the second bin. We don't know where this, where this stops. So it's good practice to manually specify the number of bins that you have. So we could do this fairly crudely and we could say one e to the seven. Seven, seven, one e to the seven being 10 million up to 20 million. Oh, but then we have to go up to 30 million, because our largest values slightly more than 20 million. So we can do this we'll have three bins, we've got four values specifying the breakpoints between those three bins. And we visualize this and we see a pretty crude histogram. I say this is crude because we pretty much have all our values in this first, in this first bin here, and then a few values in the second bin and probably one value in this, in this third bin. So these bins are too fat we don't have a good enough resolution. And so we could go we could break this up some more we could subdivide it. So the tool here is the range function built in range will just give us a sequence of values from some integer to some other integer. So let's say, oops. So let's say 21 million, and by some step, let's say we want to go by, by one million. And we have all the list of values, going up in that increment. So this is a handy way to, to build these bins. If we want to have many of them and we don't want to manually specify the breakpoint between each and every one. So there we go, we get a much finer histogram. And we know where the breakpoints are we know this is going from zero to 1 million, 1 million to 2 million, 2 million to 3 million and so on so forth. So that's the histogram, and it's telling us the spread of emissions from the facilities. In particular we see we've got a lot of values that are close to zero, very few values that are really really large. So this is what we called right skewed is an example of a right skewed histogram. Let's go ahead and calculate some measures of center. So what is the average level of emissions from these facilities. Let's do this in a couple of print statements. So if we want to find the mean, also denoted X bar. So it's going to be equal to our quantitative variable, G H GM and this case is total reported direct emissions. And we just say dot mean calculate the mean. So a good practice to include units or units here are millions of tons of CO2 equivalent. So MMT CO2 E millions of metric tons CO2 equivalent. It's a lot of give us the mean, and let's compare this to the median. Instead of the mean function we just have the median appended on there. And this doesn't have the notation X bar the median is just the median. Let's run this. You see that our mean, but 388,000 are median 65,000 or so a lot less. What this indicates to us with the mean being much larger than the median is that the distribution is indeed right skewed. In other words, this mean value of the mean gets pulled up by these much larger values up here. These are most likely outliers. So the mean is more susceptible to these outliers. The median is more robust to them. And that the median falls, you know, well within this first spike here mean somewhere up here.