 Alright, I was watching a time series forecasting YouTube tutorial and this video is going to be a little bit of a roast, meaning I'm just going to critique one little thing that I saw here and in a spirit of goodwill I think that this guy is a much better YouTuber than I am. So go ahead when you copy this, click on his video and like it. So let me give a little background. In this notebook there was a review of time series patterns, a view of the data, test train split, creation of time series features and a training of XGBoost. Finally there was a RMSC score, a very good RMSC score and that's where it ends. So I'm going to start here. The first thing I was curious about was is there any missing data or is there any missing time stamps? At first glance just running this little cell here doesn't look like there's any output for this data frame so that means there can't be any missing data, right? For time series one way to just verify that there is no missing data is to do a left join on a grid of dates where you start with the first date in the data frame and go to the last date in the data frame and so that's what I've got here. And the next thing to do is to merge that grid of dates onto your original data. I also clean up the dates a little bit because there is a little bit of extra fluff in the beginning and the end of the time stamps and then I can look at how many nulls there actually are. So before there was zero nulls here I think there's somewhere in the 30s. So a relatively small set of time stamps are missing but it's still just good practice to make sure that you are looking for these time stamps and imputing them. So now that we know there are missing values the next step would be doing imputing. So why impute missing values? First it's just good practice. Given that this data set is huge because it's at the hourly level and the number of missing data points is around the 30s there's not really going to be an impact in performance. What I mean is that if we do some imputation work here we're still probably going to be getting the same RMSE score. There's a couple cases where NA's become serious. Of course when there's a large fraction of missing values it can become very difficult to get good forecasts. I also want to make the point that missing values closer to the forecast origin are going to have a larger impact on your forecast. In the example here I will only impute on the training set and I'm going to remove all other features with the data loan I can recreate all the other features so that's not a big deal and it's possible to bring them back in the end. Since imputation relies on like features alone I could also take the opportunity to add like features in fact I was planning on doing that in this video and I did actually make some cells some notebook cells with some autoregressive feature creation but I thought that I would just leave that up to you guys go ahead and try to to make your own autoregressive features with a similar method that I use for imputation and what I'm going to show is just a very naive fill down method it's not it's not maybe as robust as some other time series imputation methods but this is better than nothing. So I just grab the training set and with that training set I just pull a example row to show that it is missing so after doing that left join this is one of those timestamps that just didn't have a target. This cell right here what's happening is I'm adding an hour to date time and it's a little bit counterintuitive but when you make lags you add you add time and then you left join that back onto the data frame so that's what's happening in this cell. I compare the missing timestamp target with the imputed timestamp target just to make sure that things went well. I can also run the dot is null but again this just doesn't mean too much until after you do a left join on your date times. So at this point in the training set all of the target variables should be imputed and we can concatenate this back with the test set and then we have a fully imputed clean data set ready for time series forecasting. At this point you can go ahead and recreate some of those same features that the youtuber had before and you can make your own autoregressive features but instead of maybe adding an hour to your lag features maybe you would add a week or a month or a year depends how far out you're forecasting and that's it. I hope you learned something new and we'll see you next time.