 Hi. In this video, I'm going to go over the data set that's going to be used for the next several hypothesis testing videos and used throughout all the different demonstrations of different forms of hypothesis testing that's coming up. So this data set is, comes from ERCOT and contains the power generation from different sources for two days from the Texas power crisis in February of 2021. Those two days are February 14th and 15th. It's sort of at the height of the crisis. On the 14th, ERCOT started shedding load basically in response to winter storms. And so we can see how power generation fluctuates during that time in this data set. The data are contained in 15 minute intervals throughout that time period. So a lot of data here. And aside from this data set, we know from other sources that the forecasted peak capacity during this time for wind was 6.1 gigawatts. And that forecasted peak capacity for natural gas was 48.4 gigawatts. What this forecasted peak capacity means is this is basically ERCOT's prediction of how much power or energy would be needed from these two different sources wind and natural gas during this timeframe to meet demand during this unseasonably cold winter for Texas. So through in the remainder of this video, I want to go through the processing of this data set. So let's get to it. So we'll start by importing the data. I've got to contain the Google Drive. So I've already mounted my Google Drive. And before I import it, it's always good to click on the CSV and open it up either in Google Sheets or Excel, of course, as we've done before. And so here we see the raw data format of this data set. And the dates in the first column, the fuel type, biomass, coal gas, a lot of different sources here, not just wind and natural gas. And here we see two different forms of natural gas. So there's natural gas and then combined cycle natural gas. Settlement type, total energy, or total power I should say. And then, and then the times that throughout the day. So starting basically at midnight and going out throughout the day, in which power was generated. And then in the cells here, we've got the different amounts of power generated. And we know from ERCOT that these values are in megawatt hours. So let's get down to the processing steps. First off, we'll import the data set using using the usual command here. It's also good to look at just a preview of the data we can do gen.head to see what, see what we have here. Look at the first five rows and last five rows. And we see what we saw in the CSV basically. It looks like everything was imported correctly and also it's not showing all the columns here so we've sticking out, you know, some of the middle columns but we can see that it goes up from, you know, 15 minutes after midnight all the way up to midnight of each day. Let's take a look at the column headers here so gen.columns. It's always good to look at the column names. Again, date, fuel, settlement type, and then all the times throughout the day. Okay, all going good. So our first order of business is to filter out some of these rows. So we want to focus on just those two days, every 14 to 15. And additionally, we only want to focus on two power sources, wind and natural gas. So we can filter based upon those two criteria. So let's make a new data frame called gen fill for filter. We'll call it anything here. And our data frame is currently called gen for saving for generation, power generation that is. And then we want to filter this. So we'll have one filter or set of filters for date. We'll have another set of filters for power source. And so in date, we'll have, we have two dates to filter. We'll separate those by an or statement. So it could be February 14 or February 15. And so here we just need to say gen.date equals. So we need to make sure we get the formatting exactly right here and here it's useful to have this preview and that it's given by head here we know where it takes on this format so month day. Year. And we want to edit this to be out of the 14th of 2021. Or the 15th of 2021. I'll do the power source on the next line here. Organize this a little bit better. And so our power sources as we saw in the file could be wind so our fuel could either equal wind with a capital W, or it could be just gas. Or we have that third type of gas dash CC for combined cycle. Okay, so let's make sure that that worked. Sure we go back to preserve our indenting here. Keep it organized. Oh, we have an error here. Series object is not oops I forgot a or statement here. There we go. Okay, and then we have it so now we are fuels only limited to gas gas CC and wind. Okay, that's all well and good. So what we need to do is that we need to rename one of our columns here. Now this is just a particular set for this data set but turns out that what Urquhart calls in the night 0000. This is going to be they allocate this to to say for example, February 14 when really in reality, you know midnight. It belongs to February 15. Right. So, so let's just change this to to 1159 p.m. or 2359 using military time. So that it's attributed to the right day. The reason for this is that the built in date time tools and in Python are going to think that 0000 should belong to the 15 when really that data value that we have there is attributable to the 14th, for example. We'll just make this quick edit gen fill, and we just need to rename, and we'll rename columns. And so our existing name zero colon 00 should instead become 2359. So even though we're, you know, taking it back one minute that it's really not substantial enough to ruin our data set. So there we have it. Oh, this should be encased in squiggly brackets. There we go. So, moving on here, we've, we're done with that. Now we can remove some superfluous columns here so fuel. I'm sorry not fuel but settlement type in total we don't need. So let's go ahead and get rid of those. So we'll continue just writing over gen fill. And so we'll say, we'll use our drop command to drop from gen fill some columns that are equal to, and in squiggly brackets, settlement type, and total. So we don't need those for the rest of the analysis. And we'll just check on this always get to check and make sure that that worked. And there we go we don't have settlement type we don't have total. Now, this next step is a little bit tricky but essentially we need to think about. So these first steps for, you know, really straightforward we can get rid of some extra stuff here and filter out the things we don't need basically. But ultimately we need to think about how we want this data organized. So, and to do that a good, good, good place to start is to think about what are the real cases here. Well, the case, or each individual case is the power generation. And this is power generation from each source on each day at each time. So, for following our standard format of the data frame having one row for each case, we need to have these generation values all in one column. We're going to have a separate column that indicates the fuel type and a separate column that indicates the date and a separate column that indicates the time. Turns out we're going to combine data times that we don't have. So we have this in one one column here but but basically we need to distinguish each value here by those other columns, the fuel, the date, the time. So in order to get it into that right format to reshape what we currently have here into something that has those columns we're going to use this melt function. All right, so how we do this is, again, we'll just write over gen fill or replace this. And we'll call our data frame gen fill dot melt. So we're going to use our melt function here. And with our ID bars. So the, the bars that are identifying the cases. And so in square brackets, this will be the date and the fuel of our name will be time so that'll be unique for each one. And then our values. So our value name is going to be, sorry, the, the time is going to be a new variable that's going to create. That's going to take all of these current column writers the different time and put them into this new variable called time. And then value name. We can call this generation. And we know from beforehand that these values are in meg watt hours, so we'll keep the units there. So it's going to take all the values that are under these time columns and put them into this generation call. So in other words, just to summarize it so these ID bars are the current ID variables, date and fuel. So those are static we're going to keep those all the other columns are going to go into a new variable called time. And the values underneath those are going to go into a new variable called generation defined by value meeting. Okay, so let's take a look and see how this has worked. And there we have it. You've got our new structure so instead of a wide fat table we've got a skinny neuro table. We've got date, we've got the fuel type, we've got time, and we have generation so each of our cases here as a single row. And that's exactly what we want. Now as I said before we can combine date and time so that we have something called date time that's going to be unique for every case here. So we need to first import a package date time. Then we will create a new variable. We'll just call it date time. We could call it anything we wanted but date time is pretty straightforward. And then we'll say from this pandas package. We need to reduce this function to date time so transform something to the date time object. And the thing we're going to transform is gen fill that date column date here. Plus, and here I'm going to put in a blank space, and another plus, and then our time. So here we're not only pulling in the values under date and under time those current columns here, but we're also specifying the format we're going to take those and we're going to separate them by a single space. So let's see what this looks like here. Go ahead and run it. And there we go we've got a new column date time that's got the date space time. Which is handy. Okay, so, and if we wanted to we could remove date and time from this. But that's not not necessary. One other thing we need to do is we need to convert our generation, which is actually units of energy here so megawatt hours and convert this to power. We're going to convert this in gigawatts. The reason being is because we have our forecasted peak capacity in gigawatts here. What we're going to want to do is compare the actual generation to these peak capacity values. So we need to have them in the same units. So let's go ahead and convert megawatt hours to gigawatts. We're going to be here basically using the state of the known formulas or unit conversions that we're going to enable here. So, we'll start by taking these existing generation values. And we can see the first thing we need to do is we need to get rid of these columns, the commas that are in. So, if we took these directly, Python's not going to recognize them as numbers, we need to remove the commas in order for it to recognize it as a numeric variable. In order to do that, we will replace these strings. We're going to replace the commas with nothing. The way this replace function works is we take our variable, we say dot string, this is referencing a string package, and we're going to use this replace function. We're going to take what we want to replace the commas, and we're going to give it a second argument of what we want to replace it with. In this case, nothing. So that'll just simply remove the commas. We can go ahead and make sure that worked. There you have it. So we can convert this to be numeric. I didn't do the, didn't look at d types before, data types before, but if you did that you'd see that this is actually a character column, character variable. So we'll say to numeric, so we can convert that to a number now. But ultimately, we're going to want to do some quantitative operations on this. So you did as a, as a number. All right, so now that it's a number now we can actually do the unit conversion. So we'll start here instead of generation and megawatt hours and we can say the answer you're not going to be gigawatts. And what we'll do is we'll take our existing variable generation megawatt hours, and we will divide by 0.25 because there's a 15 minutes as a quarter of an hour sort of dividing by these hours to take it on the numerator here. And also additionally we're going to divide by 1000 because there's a 1000 megawatts and a gigawatt. So we can make sure that that worked. And there we go. So, we've got a generation and gigawatts here. And then we can now we can clean up a little bit also we can drop some of these unneeded columns now. So we'll drop columns, we can get rid of date, we can get rid of time, and we can also get rid of generation in megawatt hours. This is going to give me some problems because I already ran that but anyway we can. Here we can put these in a chunk of code. And there we go. So now we have our cleaned up data set. The reason I got this error before is because I'd already run these and so now it can't find. I can't do this replacement because I already replaced it. So, anyway, we've got our cleaned up data set here with just the variables we want the fuel type, the date time and generation. One other thing we need to do is we've got gas and gas combined cycle. Let's combine these into just natural gas because these are two different forms of natural gas. We're going to lump them together because ultimately that forecast to be capacity is considering both forms of natural gas and the generation data is one of those two sources. So, in order to do this, we need to first pivot our data frame. So let's make a new object called genpiv. And what we're going to do now is we're going to, we have something that's, that's tall and skinny here and we're going to make it a bit fatter. We're going to have separate columns for a fuel type here. So we're going to essentially do the reverse of that melt so that we can then do some operations on the two different columns. So let's say we'll define an index here indexes date time. So indexes date time because that's a unique identifier for, for these rows basically will combine with the fuel type I should say, and then we're going to make some new columns that are based on fuel. And the values for this new data frame are going to come from our generation in gigawatts. Okay. So let's take a look at what that looks like. So there we go. Now we have a data frame that has indices so this is not a column right now these are indices of date time. And then we have gas gas combined cycle and wind and the values they're all in generation in units with gigawatts. Okay. And now we can combine gas and gas EC simply by adding those two columns together. So I'll make a new column here let's call it natural gas. And we'll take our gas column and our gas CC column. And we're just simply going to add those two values together. So we're going to aggregate them. And now after we do that, the gas and gas EC are going to be extra. So we can get rid of those will get rid of gas and gas CC since those are no longer needed. And let's see what we have now. Here we go. We've got now just two columns wind and natural gas. And we see that gas gas EC seven and a half 26.2 others together 33.7 that's about right. So so there we go. Now we want to get it back into the form that we had before where fuel was a column not two separate columns. Melt again. Before we melt we need to reestablish date time as a column. So what we're going to say is we're going to say from GenPIV take the index and make it a new column called date time. Let's see what that looks like here. So we still have date times index but we've also separately made it as a column here. And then from GenPIV we can say melt again our ID bars in this case are going to be is going to be just date time. And our values are going to be in generation gigawatts. Go ahead and run that and there we go. That's what we want. So we have a column of date time. And as we're facing the times fuel type and then our generation values. Lastly, let's go ahead and visualize this. This is an important step for hypothesis testing and seeing what the data look like this gives us a good idea of how to set up our subsequent hypotheses. So let's make a line plot here. Where we will have a line one line for each energy source. And plot it over time here. So if you plot, we'll take our data frame GenPIV and we'll have a line where X is going to be our date time. And Y is going to be our generation. We need to get this name exactly right so it matches the column there. And also, we want to specify the color in this case as fuel. So let's see what this looks like. There we go. So we've got our different fuel sources natural gas and red and blue plotted over date time here, we can see natural gas is a lot higher than wins. There's a lot more energy coming from natural gas than win just because there's a larger natural gas infrastructure in Texas. So let's have comparatively large wind infrastructure compared to other states. We will. Let's clean this up a little bit. I don't really like how you can't see the date times here very well so let's adjust our theme and make our access text. Let's make our access a little more readable. We'll say element text and change the angle to be, oh, 75 degrees. You can see the effect of that. There we go. So now we can read the dates here a bit better. We're cutting off the time so let's before we do the theme, I like putting the theme at the end. Before we do that let's adjust the, the date labels. And we can control the formatting of this as such. So we can say we want month and day. So we're only going to extract the month and day because they're all the same year it's all 2021. It's also all the same month we could get rid of the month so let's keep the month and day. And, and we also want to include our and separated by colon, the minute will put a plus here. So let's see the effect of that. There we go so now we've got some time information that we have have the hours and also the minutes displayed. Great. Let's also put in the capacities. So we have the the actual production plotted here. Let's see how we stacked up against our, our capacities. And so we can insert those capacities since those are essentially constant values over the whole time frame here. We'll put those in as horizontal lines. So we're going to do an H line, you need a Y intercept argument so basically what's the Y value for that horizontal line. And so for for natural gas. This is going to be 48.4. And our color here can be red and and our line type will be dashed, make it a dash line. So we're going to do this for wind at 6.1 gigawatts. Let's make this blue. So there we go. Now. Yeah, I don't really like how these are different shades of red different shades of blues I'm pretty unique picky here if we wanted to control that a little bit better we can say, we can define our color scale for for the curves. And we'll set some values. Let's switch this up. Let's say, so we'll have natural gas be the blue and wind be the red. Doesn't really matter. There we go. So now the colors match in perfectly in terms of shade, just how I like it. So, but the main takeaway here is that well, we can see natural gas is consistently below its capacity over this two day time frame. Wind is below its capacity sometimes. And sometimes it's above it. So that's interesting. So let's retain that, that thought as we go forward in the subsequent hypothesis testing. Okay.