 Welcome to part two in the probabilistic forecasting series. Last week we did some plotting. This week we'll see what we get to. I'm going to open up the to-do list. Alright, so last week we did the plotting. And we have quite a few more things to do. So as a reminder, what we're trying to do is we're trying to do not point forecasts but probabilistic forecasts. So that means that for every day we're going to have a distribution of forecasts. And we're going to do that using a library called ngboost. This is a library I've never used before, so I'll be learning a little bit as we go. Let's see. There's some things on here I think we'll keep. Some of them we might end up removing. Maybe we'll also add a couple items. And I could see us rearranging a few of these items. So we'll just take it one step at a time. There's no rush. I'm excited to just kind of take it slow in this series. And if you have any questions, please put them in the comments. If there's something that you'd like to go into more detail on, feel free to add that or message me privately. Alright, there is a change in the data that I made. So to get that change, go to my website, xvzf2.xyz. Go to Walmart's M5 data. And this very top link here, m5statesales.csv, is the one that we want to re-download. What happened is that I did pre-processing on the training set that Kaggle gives, but it also has a test set. And I didn't do any pre-processing on the test set. I just decided that we would clip our training set into train and test. And I think that's just going to be easier to work with. So this is, again, the m5statesales.csv, the one that's 124 kilobytes and has three series. Those three series are California, Texas, and Wisconsin. So you can either download it by just clicking on it, and it will download. What I normally do, though, is I copy the link and do wget to download. So you should have this m5statesales.csv. That's the old file. I'll just remove that and wget the new file. So just looking at this, we've got our data. I'm just looking at it by date. It goes from the beginning of 2011 and goes to about the middle of 2016. And we have our state IDs, California, Texas, and Wisconsin in there. All right. So I wanted to open up the main .py file. Let's take a look at this. All right. The first thing that catches my eye is all the warnings we're getting from Flake 8. So if I just put my cursor here on line two, at the very bottom of my terminal, you'll see Flake 8 is giving this warning. It's saying that from plot nine import star used unable to detect undefined names. And sure enough, as we go down, we see lots of the functions that we, or methods that we import from plot nine are not recognized by Flake 8. So better practice would probably be to instead of do an import star, put something like theme set comma ggplot, things like that. I'm not going to do that right now. That's something maybe I would do if I was just about to wrap up the script. And that's only a maybe. I'm not even sure I would do that. I'm really just not too concerned about that error. So we do have some errors here. I might end up showing you how to ignore those with Flake 8. And maybe we can go ahead and fix them with black. Yeah. And so that import error that we're getting is this f405 error. So we might just go ahead and work on that now. But before we do, let's just look at this a little bit longer. This all looks good. All right. So line 26 to 29. This was a copy and paste from a previous file. The main thing I wanted to grab was the plotting. We're going to delete this and make it better. This, the script this was pulled from was to work on a single series. What we want to do in this video is we want to work on multiple series. That's California, Texas, and Wisconsin. So let's go ahead and delete this. Save it. And let's work on fixing, well not fixing. Let's work on ignoring the problem. Let's work on ignoring these Flake 8 errors so that we don't have to have these distracting us. So there's a couple ways to do this. Let me go ahead and just go to the Flake 8 documentation page. If you haven't installed Flake 8, go ahead and pip install it. I'm a huge fan of Flake 8. I think it's nice to do some linting. So what we're going to do is we're going to do the configuration of Flake 8. So right now, if I were just to run Flake 8, it's going to pump out all the warnings that our Neovim text editor was showing us. So here's that 405 error. In fact, we can go ahead and pipe this to ripgrep and have ripgrep just look for f405. And it will highlight all those 405 errors that we were having. We want to ignore those. So you can go ahead and read through the documentation. What we're going to do is we're going to make a, let's see, it says in here a setup.config file. Where does it say that? Setup.config. So let's go ahead and copy that. Do touch to make a new file and paste setup.config. And go into this file. And it'll go over, if you read this page, you'll see the format. It's a NE format. So it looks similar to Tommel. Or maybe the right way to say it is that Tommel looks similar to any was before Tommel. But you've got to have this Flake 8 header. And we're just going to do ignore. So we're just going to take this, this top part here. That's not the error we want to ignore though. We want to ignore f405. Is that what it was? f405. And we're also going to make sure Flake 8 knows this is the file format we want to use. So Flake 8 has a few different NE files that you can make. You can make it .flake8, a setup.config, and a talks.NE. We chose setup.config. So we just want to make sure that it knows that's a configuration file we want to use and nothing else. So, for example, if you look in Flake 8's help file, you'll see this configuration flag. It says dash dash config. Path to the config file that will be the authoritative config source. This will cause Flake 8 to ignore all other configuration files. So we're going to run that really quick. We're going to just cover our bases and make sure that Flake 8 knows which file we're going to use. So Flake 8, config, and it's setup.config. There we go. So just run that. We can tell it's working because there is no more of these f405 errors. Let's see. Is there anything else here that we want to ignore for now? I'm just reading through these. So f403 says from plot 9 import star enable to detect and define names. I think I might want to ignore that one for now. Continuation line overindented for visual indent. Maybe, maybe. Yeah, let's go ahead and add E127 and E124. E127 and E124. I think that'll be good. Envim setup.config, E127 and E124. All right. And we can run Flake 8 again. Oh, and also f403. I want to ignore. I think that's good. I want all the rest of these errors. So let's look back at main.py. We'll see that lots of those. Well, this line 2 used to have this warning symbol by it. And that's clear now. But we're still getting some of these errors. Missing white space after comma. So we're going to go ahead and fix these with ale fix. All right. Let's just say no fixers have been defined. Try ale fix suggest. All right. So what this is telling me is that I have an old Vim configuration file. So we're going to hop into my Vim config and make sure I've got an ale fixer in there. Yeah, let's go ahead and do that. So it's in my .files and Vim config and it .vim. Go down to ale. So yeah, sure enough, I don't have a fixer. I've got linters, but not a fixer. Let me just... I can't remember the syntax off the top of my head. I'm going to have to do a help here. Help ale fix and configuration. There you go. So the fix is a lot like the linter. So our fixer, what I like to use for fixer is black. So we'll just put black here. Black is another command line utility. You can just run black on a file and it will fix it. But I like Vim to fix it so I don't have to go back to the command line and run black. That looks good to me. Yeah, so this was supposed to be a video on probabilistic forecasting and now it's a video on linting and fixing. So I think this is good. I think it's good to see some of this. So this is ale fixers. That looks good to me. If it doesn't, we'll come back in and we'll fix it. All right, so now we should be able to run ale fix and it should find black. Perfect. All right, it just fixed our file for us. And our git gutter is showing us that these are changes that we've made since we lasted a git commit. Everything is working great. So now we have a clean script. 23 lines. You know what? Before I start running this, I want to open up Tmux. Get some of these going. So Tmux, let's see. I should have that shell command where I'm looking at files. Yeah, I think. Let's keep this going for now. Let's pop up this file every time we change it. So I've got two Tmux sessions. One is the Vim terminal and one is going to be for some of the shell stuff that we do for now. Maybe we'll change that a little bit later. So I'm going to open up my Python terminal and start running these things. This is send to window. Pushing these down. All right, so here is our plot. Just some observations. We've got California here. Texas is yellow. Wisconsin is blue. It looks like, I mean, this is a pretty standard time series from the retail, the real retail time series I've seen. This looks pretty standard where around the holiday season you have these dips. This is probably, maybe they're closed on Christmas. In fact, we could verify that if we wanted to, but it's something that's got to be Christmas. That's got to be the 25th. Maybe I'm wrong, but I'm pretty sure that's what it is. No other big observations here that the time series do look seasonal. That's also pretty standard with retail or retail like time series. Usually with highly seasonal time series, something like lags, seasonal lags end up being pretty competitive benchmarks. So if you're running a, maybe a complex model, you want to make sure that your model is at least beating lags. The thing that messes up lags are these holiday drops. Other than that, for standard time series where things are going as usual, those lags can be pretty effective. Yeah, I guess no other comments here. In fact, maybe we should go ahead and start. Maybe we'll make a day name feature and let's make a really simple train and test set and see what we get with some lags. So let's go ahead and dive into that. All right, line seven. This is getting us here right now. I don't like how this looks. I'm going to make Flake 8 angry again and indent this the way I want to indent it. E113. I'm going to add this to Flake. I just don't want to look at those. All right. Dang, there is a lot of Flake 8 issues with this chaining. Oh, I did the wrong error. I think I did 123 not 128. Let me actually keep this open so I can see what's going on here. Wrong one. E12, there we go. Wow, I've got a lot of ignores here. I just like to do chaining. Every time I run black, it's going to mess up this visual I've got going on right here. Now that I think about it, I don't think I've ever modified black to ignore this specific kind of thing. I should because I run black and it fixes everything that's great, but then it ruins my assigns here. The reason I like it is it just makes it easy to copy and paste in Vim. Just YY and then put. I'm always making new columns. Love chaining. I think it's so clear to read. We want to make a day. Day equals lambda df. It's going to look a lot like what we've got above. Day name is wrong here. Day name added Saturday and Sunday. This is our new column here. Saturday, Sunday. It looks like just Saturday and Sunday, but all of our day names will be there. Now we have a day name. That'll make it nice for our lags. That took care of that. The other thing we wanted to do was make a training set. I would say let's hold out three weeks and we'll try to forecast those last three weeks. I like that. I'm going to make a variable train max date. Let's grab this df, raw, and grab ds. ds is our date time object. Date is a string right now. We definitely want to grab ds and grab the max. What happened here? Wanted this to be df, raw. Let's go ahead and just make all of this df, raw. There we go. That's our max date. We want to get three weeks before that. We're going to need time delta. If we want time delta, we're going to have to import date time. We'll do that. We could make a new line here for the max. Now this is how I want to do it. Let's subtract date time delta and let's do three weeks. That looks good. That'll be our max date in our training set. Let's make our data frame of training data. Let's see. We want ds to be less than or equal to our train max date. Let's see. I think that'll be good. Since we're going to be making lags, maybe some moving averages, I'm going to want to plot that data to see how it looks. I'm going to append some zeros on here for now. This will represent, I guess, NAs for our test set. When we plot, we can just have placeholders there. I'm just thinking out loud if this doesn't make sense. Maybe it'll make sense when we get to plotting. I'm going to make a test set of just zeros. This isn't really a true test set. I guess I should say this is just adding zeros. This is just adding zeros onto our training set. Maybe dftest is a bad name. This is the first thing that came to my mind. So dfraw, let's grab the complement of what we got above. Up here, we're doing less than or equal to. Let's just copy and paste this. Let's change this to greater than. This is going to be all zeros. What's it saying? Missing white space. All those sales are zeros. And we've got, what is this? Oh yeah, we can fix this. E501. So Flake is getting on us for E501. Let's ignore that one also. That's probably my favorite error to ignore. E501. We have big screens. I don't understand why we're still holding ourselves for this 80 character limit. It just doesn't make sense. I'm very happy ignoring E501. All right, so that looks good. We've got our train set, a test set. That is just sales times zero. That should make plotting look nice. Let's glue these together. We're going to take our train set and append dftest zero. Now onto making some of the time series features. We're going to do the same thing that I did in a previous script where we're going to be grouping by day of week and by state ID. Let's go ahead and make a horizon and seasonality variable. Horizon equals 21 because that is seven days times three weeks. We're going to make a seasonality. That means that seasonality is this idea where your business has some kind of regular pattern. Maybe if you're looking at sales on the monthly level, you would have a really small residual difference this year over January of next year. In our case, we've got daily data, and I'm assuming that Mondays all look similar. If we subtracted this Monday sales from last Monday sales, we'll have a small residual relative to if we difference this Monday on Saturday. I think that in this data set assuming a seasonality of seven days makes a lot of sense. Let's see. This is the tricky part. This is kind of the meat of the video is doing some of these rolling calculations or grouped transformations. Let's make sure we've got these variables here. We are going to do this piece by piece. DF, assign. The first thing that we're going to do is lags. We're going to lag sales. This is going to be grouped by state ID. The reason why we group by state ID is that if you think about our data set, we've got California, Wisconsin, and Texas all together. If we were to lag without respecting state ID, then we'll probably have lags running into each other, which we don't want. We don't want Texas lags to pick up on California lags. We're grouping by state ID. We're going to grab sales and transform by shifting by the horizon. This is lagged by 21 days. Let's see if this is working the way we think. Great. You'll see this column, lag sales. You're going to see NAN. That means that there's just nothing for that lag to pick up on until we're 21 days past the start. We'll do a graphical check on this once we get a couple more of these. We'll plot it and make sure that everything's making sense. We have lag sales, which is a 21-day shift in the horizon. This could be a really good feature that we're going to pass to NGBoost. Let's go ahead and add some more. I'm going to put these parentheses around here to do a multi-line edits here. Let's do lag sales, too. Here we're just going to shift the horizon, but we're going to shift it by... Actually, let's not shift the horizon. Let's base everything off of this lag sales here. We're going to start ignoring sales altogether. The idea here is that we have this forecast horizon of 21 days. If we make our features off of this lag sales, which is a 21-day shift, that means we don't have to worry about data leaking in the time series. Data leaking is this idea that you're going to be using a value that won't be present when you're actually trying to forecast. It sounds like a simple thing to avoid, but I've just seen it happen so many times in practice. It's something that's easily missed, where maybe you're only lagging by one week and you're basing your features off of a one-week lag, but you really need to forecast two, three, four weeks. In all your training and development, your model looks great, and that's because you're using data that won't be available to you when you actually had to forecast and put it in production. One way around this, and maybe there's better ways, but one way around this is to start off with this big lag, the lag that's at least as large as the forecast horizon that you're trying to forecast to. Then build all of your features off of that main lag. So lag sales is now, for our purposes, sales. Sales doesn't exist anymore. We don't want data leaking, so we're just going to work off of those lag sales. We're going to use lag sales since we're using lag sales. We're just going to shift lag sales back by another seven days. That looks about right. Let's do this one more time. All right, Flick8 is telling me I need to add spaces here. All right, so this is all lags. Let's add... Moving averages are pretty common in time series. Same with exponentially weighted moving averages. So let's add a couple of those. Let's just call this ma1. We'll group by state ID. Use like sales again. Let's see, what do we need to change here? Lambda xx.shift. It's not shift anymore. This is going to... Let's see, rolling. Yeah, not really, rolling. With our window equal to seasonality. And we're taking the main. The main thing you can do here is other than the main, you can do standard deviation. I'm not going to do that right now. That could be a really useful feature. Make sure there's no errors. I don't like those parentheses down there. Okay. Now we're going to do the exponential weighted moving average. We'll call that EWM1. We'll let it match our moving average syntax. Group by state. Use like sales. Transform. Not rolling. We're going to do EWM. Unexpected window. Oh, yeah. EWM doesn't have window. We are going to do span. That means we have a 7-day exponential moving average. Yeah, I think that looks good. Let's just make a new variable here. We'll call this df roll. And I think we're ready to plot. And I want to zoom in for this plot. So our current plot is pretty large. It goes back pretty far. I just want to see some of the most recent months. Let's make a data frame just for plotting. So we'll call this df plotting. And it's going to be our df roll with just the last few days. Maybe the last couple of months. Dates greater than or equal to. Since this is a string, I've got to quote it 2015-1201. I want to get December in there just so we can see the mess that holidays make for these lags. If I don't get the holidays in there, it'll look like these lags always work. And there's no reason to ever do anything other than a lag. Okay. So yeah, I just want to make sure we've got that in there. So the same theme, same color palette. This is now df plotting. That looks good. Let's just stick with the lines for now. Let me start with just one. Not AWS. AES. AES is aesthetic. So we're going to pass the aesthetics. Y equals lag sales. And I'm going to gray this out. I think the one year is going to be too long. Let's do one month. And we're going to still need to keep the geomline. This is going to be for our main time series. I think everything else there looks good. Do I want to change the name? So let's get our script to look for this file now. So we're probably going to modify this quite a bit. Might as well set it up here. There we go. Let's just try that again just to make sure that's working. Okay, that's it. So this is our lag sale. So this is shifted 21 days or three weeks. And the pattern looks pretty good. You'll see the zeros down there. That's just because, you know what, I do want to keep the points. I'm going to change things a little bit. There we go. I think that's a better look. Yeah, that looks pretty good. We've got some lags in here. So you can see how the other lags do forecasting. Maybe forecasting is the wrong word. What we're really doing here is time series feature engineering. That's the purpose here. So that was lag sales. Let's add. Let's copy this, paste lag sales two and lag sales three. Yeah, look at that. So the lags are doing a pretty good job. I would say capturing this time series. This is for California. There's all right at Texas. Probably not so good for Wisconsin. You also notice the different peaks down here. That's because our lag sales is a 21 day shift. So that's this first peak. And then our lag sales two is seven days after. So that's that second peak and lag sales three is the third peak. So that's what explains these dips here. All right, that looks good. Now let's see what this moving average and exponentially weighted moving average look like. So we'll add MA1 and EWM1. All right, so you'll see these new, they almost look like trend lines in here. New kind of trend lines. So moving averages, the longer you go back, the more of a trend line they are. Like you can do a monthly moving average where you're averaging over a month. Let's actually change our moving average. Let's cover moving average group by day also. So instead of doing a moving average of like looking back would be like Saturday, Friday, Thursday. And those contiguous groups. Let's like it Saturday and the previous Saturday and the previous Saturday and do moving averages of that. So let's go back up to our moving average calculation. This is going to be really easy. All we need to do is change our group by. So the group by here is state and we're going to do not just state but day. Okay, that's looking pretty good. So it's amazing, at least to me, I think it's pretty amazing to see that lags. And week over week moving averages and exponentially weighted moving averages. I get this kind of pattern captured. So when we do benchmarks, we want to make sure that we're doing benchmarks that are competitive. Simple methods that seem to be pretty decent forecasters. They don't work in all cases. You'll see they work best for California. Texas, they do okay. Wisconsin, it's a little bit more of a mess but still not too bad. We still have a pretty tight band here. I don't think anyone would say that these forecasts are not sensible. So I think these will be good features to include in our model. All right, let's go ahead and move on. Actually, let's look at our to-do list. Group stats, that's kind of what I meant by these lags. So we'll say group stats, lags, MA, exponentially weighted moving average. Variations, covariations, I don't really want to do ACF right now. Maybe that'll make it later in the series. Maybe not at all. But we're going to skip that for now. I guess these time series predictors are kind of like these group stats. Yeah, let's go ahead and just... No, I think... I'm going back and forth here. When I'm thinking time series predictors, I guess I'm thinking more like holidays. Things that are external to the time series. Holidays, promos. We'll probably not use promos. But we could pull in holidays at a later time. So let's go ahead and mark this off. Predictors, we can add for later. Probabilistic forecasting by group. I think it's about time for that. But you know what? This has been kind of a long video. Maybe we call it good here. I'm trying to think. You know what? Let's just keep going. This is going to be a longer video. All right. So we want to use ngboost. And I have not used ngboost before. So we're going to look at the documentation here. Installation. Before this video, I did install ngboost. So I wasn't taking up the time there. Usage. Let's just make sure everything is functioning. So I'm just going to take this and run it. All right. It looks like ngboost is working fine. And we're getting mse and our log likelihood. All right. So we aren't going to use Boston. We aren't going to use those train test splits right now. We're using a really simple train test split at this time. And we're not going to use mean squared error. So we'll pretty much use none of that. I'm going to take this, move it to the top, and not use any of these. So not too much to use there. So I'm just thinking how to split up our train and test sets. I think we've got everything that we need. Yeah. We've got everything that we need. So let's take a look at our DF role. This is the most recent data that we have that. The data frame that wasn't used for plotting. We've got some NAs in there. Let's go ahead and drop those NAs. Let's see. What is it? Drop. Drop NA. I think that's it. Yep. Drop NA. We'll call this DF prep boost. All right. And the training set. So X underscore train. Let's see. So this is going to be the first part of DF prep boost. Let's just make sure that we're grabbing our dates that are less center equal to that train max date variable that we made above. We don't want sales in there. You know, maybe a better way to do this is just to say what we do want. Yeah. Let's do this. Yeah. That'll be a good X train. Now let's get Y train. Right. And X test. I think that should do it. All right. So NGB regressor. Let's take a look at this somewhere. So we don't want to do point forecasts. So we'll probably not even use MSC distributions. All right. So this is what we want. We want distributions options. We have our normal log normal and exponential. I say we just pick exponential for now and see how that goes. Yeah. Exponential is a distribution that's a should be one parameter family. I'm not sure what the location scale is. I'd have to read a little bit more. But I think this should work in this case. We don't have negative values in our data set. Yeah. We'll import this exponential distribution and give that a shot. So we just toss this disk in our NG regressor. And just for clarity, let's just call this NGB EXP for exponential. What else do we have here? I'm going to make verbose true. Let's go ahead and do that. Verbose equals true. What are the keywords do we have here? Score. Let's do score equals a distributional metric. Log score. I don't want log score. CRPS score. That's what we want. All right. I think we're getting pretty close. Yeah. I think this is good. All right. So we should have our data here. Oh. Our tests are all zeros. We don't want that. Let's get our DF underscore raw. Not what we appended. There we go. Yeah. Just as a reminder, I did something weird here. I'm not even sure if I need to do this, but for plotting, I added zeros here just so that we could see how the lags and moving averages turned out as a forecast. So I don't want the data set that this was built off of. And so that's what the DF role was. I wanted to go all the way back up to raw and get the Y test from that data set. All right. So I think this is good. Let's give it a shot. Exponential not defined. Okay. I didn't import CRPS score not defined. Got to import CRPS score. Scores. Okay. All right. We got something working here. Now what happens if we do predict that length looks like it's predicting accuracies. We could test that. So what is the length of this? 63. 63. Yeah, it's giving us accuracies. So that's probably what this predist is. Let's see what predist gives us. A object. So we'll have to figure out how to get our samples from the object. If that's even how this package works. All right. So it looks like predist gives us the parameters of our distribution. Let's take a look at that. So we don't want predict. Well, this could be useful actually. Let's just come this out. We might want to glance at that at some point. But the whole point of this video is to do probabilistic forecasting. So let's take a look. So what do we do here? Let's grab the very first item and get params scale. All right. That makes sense because exponential distribution is just scale. So let's actually take a look at sci-pi stats, exponential. Let's pass that parameter into exponential and see what we get. Import that. Maybe we'll just generate some random variables with that scale. All right. So we'll take this and let's do expand dot with RVS. I popped up over there RVS. But I want to pass in the scale. I can't remember how to do that right now. Oh, here we go. All right. We've got a couple of interesting options here that might be alternative. But for now let's just do a random sample. So I believe we should just be able to pass that in there. Oh, this is a dictionary. Let's do get scale. It's a little ugly having that there. Let's move it up. So we're just doing a random draw here. The scale parameter of this exponential is 14,803. And we're just going to do a couple of random draws. That seems to be checking out. It seems like a good amount. So let's go ahead and do a real draw, a 5,000. We're just going to call this draw. So we should be able to do things now like get the quantile. That's not how you do it. Quantile of our draw. Let's get the 50th quantile. I guess this is NP probably. And we need to import numpy. That seems large for the median. What was 14,803? What was the actual 19,117? And the median here is 10,000. 90th percentile is 33,000. So that covers it. Interesting. Well, maybe this isn't the worst. I'm not sure. We'll have to do a deeper analysis here, but this seems promising. So I need to learn a little bit more about NGB regressors and pulling out some of these values. What I'm thinking about doing is passing every single one of these scales to a draw of exponential random variables. And then we can look at the CRPS score of this exponential NGB regressor and compare it to maybe a benchmark. Something simple. So that's where we're at right now. I think we'll go ahead and end this where we're at. And let's take a look at the to-do list really quick before we end here. All right. We did some probabilistic forecasting by group. This is more cross-learning. So it's not by group. We'll just call it probabilistic forecasting. By group would be more if we're doing Univariate Time Series. And that's where I was so far. Thanks for watching.