 Hi, in this video we're going to cover summarizing categorical data and in doing this we're going to cover frequency tables that is counting the values within each category in a categorical variable and proportions. So distilling that information into a single value or proportion. So let's get to it. We're going to continue on with our state energy ranking data set from EIA. In order to exercise this we're going to have to make a new categorical variable. So the categorical variables that we have in there so far. We just have state, which isn't very interesting because we have a single state for each, each category. So let's make a more interesting categorical variable. And our categorical variable here is going to ask the question produces natural gas or not. So we'll make a new variable and say df3 in square brackets and quote will simply ask produce produces ng for natural gas question mark. So we'll expect yes or no response here. Yes it produces no it doesn't make this happen we'll borrow a function from Python called where so the where function and the way the where works is we first give it an argument that returns a true or false. So some kind of Boolean statement as we used before. So for example we can say from pandas is null. And we'll ask ourselves is this df3 natural gas column. No or not. So it's going to go through value by value in this variable natural gas. And if the value is no, it'll return whatever we have for the second argument here. So we'll put a no, it doesn't produce natural gas. However, if the value. If is null returns false so it's not null. Then that means that that state produces a natural gas. So that's the way this where function works, you always give it something that returns a true false so Boolean statement. It's true. Then it's going to return whatever values in the second argument if it's false it'll return whatever values in the third argument here. Let's see this in action. We can just take a look at this variable real quick. See what we've gotten. And we see we have a bunch of yeses and a bunch of noes. So remember these last 18 or no because they had no values for natural gas, but the upper ones all produce a natural gas. Let's take a look at the data types for this. Produce natural gas object. So this is a text variable, not a category when you so let's course this to be categorical variable and can move on. And so we just need to use our as type function here and as type category. We'll take a look at our D types again and make sure that worked. And indeed we've got a cargo categorical variable there. And so we can move on. So now let's make use of our frequency table to see how many states produce natural gas and how many do not. So we'll make will store this table in an object called freak. freq. Get our data frame, our new variable produces natural gas. We'll start with that. In case of the square brackets. And the function we want to use here is value counts. So it's a dot value counts. And as the name of this function implies it's just going to count the values in this variable category by category. So let's take a look at what we get. So print our frequency table. And lo and behold, we see that we have those 18 states are no, they do not produce natural gas. And the remaining 33 are a yes, they do recall that was 51 states because we're including DC. Okay, well that's great. Let's find a proportion from this. So if we want to ask ourselves well what proportion of states produce natural gas. And you could probably do this in your head, or very easily with a cat with a calculator, or even just in the cell here be very easy to say, Oh, well it's 18 divided by 33 plus 18, right. And you'll get the right answer. So this isn't really programming. So if we, we have to ask ourselves if we were to update this data set, if we got this next year, and we wanted to just reuse this code. Could we do so without having to go in and really edit off a lot. And if we write it this way then the answer is no, we'd have to go in and manually adjust these values by inspecting this table and that's involves work. So this in code that will automatically calculate the portion for us, even as values change. So let's first find our numerator of the proportion. And so we're going to reference our frequency table. So this frequencies table is just stored as a series. So it's just the values 33 and 18 indexed by yes and no. So if we want to access those states, the counter the states that do perform produce natural gas, we just need to say, yes, so frequency square brackets yes, and that will return our value of 33, as we see here, the denominator. Well, we wanted our denominator needs to be all the states so we need to get 33 plus 18. And there's many ways we could do this we could say frequency of yes, plus frequency of no. Something that's a little bit more general would be to just simply some make use of the some function is frequency table. And this is more general and easier to use if we had say, suppose we had five categories, we wouldn't want necessarily necessarily type out all five categories or if there are say 50 categories who type them all out. But regardless of the number of categories if we do some that we'll get will always get the total of the frequency of the frequency table and we'll get the total number of observations that we have. 3351 observations states. Let's put this together for our proportion here will be just our numerator divided by our denominator. And let's print this when we'll print our proportion is equal to product. So our proportion 0.647 a whole bunch of digits. It's good practice to round this off let's run this to say three significant figures and redo this. There we go. So 0.647 states produced natural gas. We could simplify this so here we did this in a whole bunch of different lines but we could really just get this done in in one line. We can just replace prop here with the input that we had for numerator object divided by the sum of frequency. And there we go. We've got it done in just one line. We'll get the same answer. Or if we want to turn this into a percentage, basically, instead of a proportion, just convert this to have units of percent. We'll sell the same value but we will now multiply this by 100. And also important to include percent afterwards. So we'll do that. I don't like this little space after there. We can access the argument set or whatever we want to separate our values in the in the print function. And I'm going to the default is to have a space there. I'm just going to get rid of that. And there we go. So that looks a little bit better. 64.7% of states produce natural gas. So there we have it. We've developed a frequency table or the count in each category. This is a very simple example of just two categories. But, you know, if we have lots of categories, it'll work just the same. And we found a proportion using code here. And so as the values of the underlying data set change, this code will update accordingly and we've got proportion reported in a bunch of different ways here. Okay.