 Hi, in the next few videos, we're going to cover some data processing and data manipulation code in Python starting in this video looking at data types and doing some other useful procedures in Python. So let's get to it. To exemplify these procedures, we're going to work with data from the Energy Information Administration in particular. We're going to look at state rankings. So we'll go to their website that's linked here. So the US Energy Information Administration publish, they publish a wealth of data online. What we're going to take a look at is under this production section here. We're in particular, we're going to look at two data sets, two CSV files, one total energy, one natural gas. And these are quantities of energy and natural gas produced by states and they also rank them. I encourage you to go on and explore other data sets published by the Energy Information Administration. But for now, we're just going to focus on these two for simplicity sake. It's also useful to preview the data and spreadsheet format before we start playing around with it in Python. So let's start with the natural gas data set. So here's natural gas production, ranked by state, one down through 33, 34, starting with Texas on down. Here are the quantities of natural gas. We also have this blank column here, note rankings are blah, blah, blah, blah, blah. Interesting. Turning to total energy production, also ranked by state, going down to 51. So here we have all the states, including DC. And we also have this curious blank note column here. Okay, so let's dive into the Python code. I've already mounted the Google Drive and imported the data sets following procedures that we've seen in the previous lesson. I've also already run these. We've got our natural gas data set stored as NGDF for data frame. And we've got total energy stored as TEDF. Now, if we want to just see what columns we have, a useful function here is the dot columns function. When we run this, we just get the column names. This is as opposed to say for example, going back up here and saying NGDF and getting the whole entire data frame just to scroll up and see what columns we have. So this is a more concise way to have it. Furthermore, we can do this for the total energy data frame, see what columns we have there. So both have rank, both have state, the natural gas has natural gas market to production in units of million cubic feet. Total energy has total energy production in trillions of BTU. And then both also have this note column, which is not an actual variable. Furthermore, if we wanted to store these, we could say, for example, make a new object called NG columns, which would be equal to NGDF.columns. And I'm going to add something else here. I'm going to add dot to list. This will just convert all these names into an object called a list. So this could be useful if I wanted to store these names and reference them later on. Okay, another useful tool is to remove columns. So for example, we have this weird note column that isn't actually a variable. It doesn't contain any data. Let's get rid of that. So starting with the natural gas data frame NGDF, we'll say NGDF dot drop. So drop is the function that's going to get rid of whatever column or columns we specify here. So I'm going to go up to my, I'm just going to copy and paste the column from here. Make sure we drop this. And furthermore, at the end of this, I'm going to run NGDF.columns again, just to verify that we've gotten rid of that column. And indeed, we have, we no longer have this note column. Let's just repeat this real quick for the total energy data frame. This has a slightly different named note column. So we'll copy and paste there and run that as well. And we see that that has been removed. Yet another useful function is to rename columns. This is particularly useful if we've got some overly verbose column names. For example, total energy production truly BTU. I don't want to have to type that over and over and over again. Let's call that as well as this natural gas marketed production, something else. So the syntax for using this rename function is stated generically above here. But let's put this into use. We'll again, alter our NGDF or data frame. So we're going to perform some function on the right hand side and store it as what we've previously be calling it. So overwrite it. We'll say NGDF.renames the function use here. And again, we're going to specify some columns. But now we need to put in a dictionary. So curly brackets. And we're going to give the existing name. So we want to change this name right here, this natural gas marketed production. I'll put that in there. And we'll put it in a colon. And after the colon, we're going to put what we want to change it to. Let's call it natural gas. And again, we can use our columns function to verify that that change has happened. Indeed it has. Let's go ahead and repeat this also for our total energy. Just rename these object names. And instead of natural gas, we have total energy that we want to change. Stick that in there. And we'll call this total energy and run this. And again, we verify that that changes happen. Okay, so that covers some handy tools to manipulate the data and get it into something that's more manageable. Let's turn our attention to data types now. And oftentimes, after we import data, we need to convert some of the data types because in the import process, Python hasn't automatically recognized some variables as what they ought to be. So let's start by finding what data types we currently have. Let's start with our ng data frame. And the key function here is d types. D standing for data, type standing for types. Go ahead and run this. Here we go. So this rank variable is int 64, int standing for integer. The state variable is an object. Object is another name for text. Natural gas also text. Let's do the same for our total energy. Again, rank is an integer. That's what we expect. State an object, aka text. That's what we expect. Total energy, also an integer. That's interesting. As compared to the natural gas, data set, natural gas is an object. And really, we don't want that to be an object. That should be a number, right? It shouldn't be text. We would prefer that this were stored as a quantitative variable. So how do we make that happen? How do we convert this to a quantitative variable? Well, the first thing we need to do is we need to recognize that this natural gas data set, if we go up and inspect it, has a number of dashes in the natural gas column here. This poses a problem. If we try to convert that column to integers or floats, we'll get an error. So we need to replace those dashes first. And what we're going to do is we're going to replace them with something called NaN, since we're not a number or no value. And we see this above. If we scroll back up again, we see that NaN values have been inserted into this strange note column. This is very important. So these values are basically placeholders saying that that cell was blank. And so instead of just leaving it blank, it actually has something there to make it more obvious that it's blank. Really important is that zeros were not automatically stored there. You need to bear in mind that zero is an actual value. If zeros were here, that would imply that there was a value of zero, that we knew that there was a zero quantity there. But in fact, these are blank. We need to note that with NaN. So let's go back down and substitute these dashes with NaN. Now a handy function for doing this is this replace function. So we're going to replace whatever the first object or first argument is here. So dash dash, that's the thing we want to replace. And we want to replace this with NaN. Now we can't just write NaN like that. NaN is special. And to make this special quantity appear appropriately, we need to use the numpy package. So we'll import that as np. And we need to call np.NaN. So this is clearly telling Python that we've got these special NaN values here. Let's take a look and verify that this worked. So reprinting this data frame, we scroll down. And indeed, where we have those dashes, we now have NaN. So now we can move on to converting data types. So we will take our ng data frame. Here we want to just convert the natural gas column. So we're going to just reference this column that we do that by taking our data frame, putting square brackets, and then within quotes, we've got the variable name there. And we can say dot as type float 64. That will convert it to a quantity. We can check and make sure that that worked. We can say ngdf.d types. And indeed, natural gas is now a float. That's great. Another thing we can do is we can say, for example, state to a categorical variable. So ngdf state. We'll use the same syntax here. So as that type, except instead of float, we'll say category. And we will repeat this for the total energy data frame as well. And we can verify that that worked. ngdf.d types. We see that state is now a category. And indeed, for total energy, you will see the same thing, category. Okay. So that's a bit of data processing and investigating and changing data types. Thank you.