 Hello, everyone. I hope you're doing super well today So the other day I was having some Indian food and I thought well Why not analyze data of an Indian restaurant? And so if you click on this link it gets you to a Kaggle data set of Indian takeaway food now I don't know. I was just hungry. Okay. So anyway, so let's just look at this data for a sec So this is a set of all online orders that were placed at this restaurant So you have an order number you can see it's kind of multiples like there's like multiple rows of the same order number where each row Represents an item that was ordered for a particular order So if you take all these six rows together the first six, I can't you can't Highlight. Yeah, all these six rows together. You'll see they represent one order consisting of six unique items and Yeah, just their corresponding prices and well the total will be like six items So the total number of products is six and that's it Now given this data Let's say that I own this restaurant and what I think would be really useful is like if I know the number of orders that are coming in On a specific day, I can hire adequate staffing for you know, since these are online Maybe I want to deliver these orders personally instead of using services like DoorDash or whatever it is, right? And for that I could appropriately just staff fine you like, okay We're gonna have a lot of volume coming in two days from now I want to be able to staff for two days from now Right and get appropriate driver so that we can deliver these orders adequately instead of being inundated and being surprised on the day So that's kind of the entire gist of like our objective here So we want to forecast our order volume for staffing in this case for for let's say hiring interim drivers Right and we can do this on the fly daily just to keep things simple, right? so yeah, that's a bunch of data and what the way that I want to approach this as a time series forecasting problem is It kind of hearkens back to my video that I made a while ago in January where I Made a video on time series forecasting with machine learning and I kind of broke down this chart here Where I say there's two approaches that you can take one is using like traditional time series models kind of like You know the univariate models or like, you know a Rima typical time series forecasting way or You can use machine learning models and try to treat time series as a regression problem Now when I mentioned this I know there was like a lot of confused faces of like, okay What's what's going on? How how do you treat it as a regression? Could you frame your data a little better? I explained it in the video, but this time let's try to do that same thing in code So we're gonna do some order forecasting But in a way that's suitable for regression Okay So cool. Let's just now actually jump into the code First of all, I'm installing like pandas sequel here And this is mostly because well in the real world when you're actually, you know working You would be using sequel to query a database as opposed to just manipulating data frames and csv files And at the moment I do not have access or connectivity to a database And so I use pandas sequel to kind of mock that experience so that the code is as close to as possible Is what you could be doing in you know, if you're working for this indian restaurant and Coming up with an order forecasting Cool All right, so importing a bunch of libraries now pandas data for manipulation library Seaborn good for those pretty visuals and pre plots that we'll see later same with map plot live And sk learn importing test train split to create our test and train sets Um, and then we have an xg boost regressor which I'll be using as the main model for this time series forecasting Um, yeah, and then on this line, I'm importing pandas sequel What it's doing here is that it has this function called sequel df And we are creating a built-in function ourselves called Pi sequel df which is a one line function here that takes in q which will be a query And it will execute sequel df along with passing a set of global variables So instead of just calling sequel df passing your query and passing the set of global variables every single time All I need to do is call pi sequel df and passing the query and we're good to go to execute the query Cool All right So now i'm just reading the same csv that I showed you in the kaggle note in the kaggle uh data set And i'm renaming those columns since the columns had like spaces that are not really friendly to us So looking at two of these orders. Well, this is just how they look in general Uh, the the problem with this timestamp right now is that it's in a very strange string format That's not exactly in the form of a timestamp because it doesn't have a second seconds, uh field here But we can kind of finagle our way into just Extracting just the date part and then converting that into an actual date time Variable and I put that as an extra column right over here under date So these two columns should be the same which they are they represent the same thing And now the total number of orders in this data set is the number of unique, uh Numbers over here in this column and that's 13,397 which kind of agrees with what we saw in the kaggle data set Cool Now I am loading a query. I'm looking basically well I created this little helper function that helps us read a file and load its contents and return its contents in general And this is pretty useful when you're dealing with sequel queries because many a time I like having my sequel queries Especially if they're huge in a separate file and I'll be reading from that file so that my notebook looks, you know Looks pretty clean right Uh, so let's do that. So first of all, I'm literally reading the query from Query slash daily orders uh dot sequel, which is a sequel file that I wrote Um And I'm just getting two of the samples from here which that which my file returns So it's basically a date and the number of orders that occurred on that date Now if you're curious as to how like what the query really is I just wrote up a really simple query right here. So Let's go to daily orders Excuse me So, yeah, here it is. So first of all, I'm selecting the number of just so I'm basically saying, hey Just get the order number and get the date right and get uh, Just distinct order numbers and dates because that's all the data. I'm interested in I'm converting date into an actual date because this is technically a timestamp You don't want that Um, and then I'm just doing a simple group by over here where we're taking the date We're grouping by the date and then uh getting the number of corresponding orders So for every date we have the number of orders that occur That's it Let's go back Where did I go? Sweet All right, so that's what gives us the resulting data frame every single day. We have orders Now I'm Right down here. I am kind of importing logger and saying saying the level to log in critical because Map lot lie, but at least the version that I have installed kept on throwing some infos They weren't even warnings. It was just Some of the developers were logging down info and put a message in there and I just didn't want those So I just got rid of them for now. You can comment this out if you want to see what those that information was Anyways, I'm plotting this chart now. Uh, the x-axis is the actual dates. The y-axis is the number. So basically, uh Plotting out the data frame that I just printed now. You'll see here that I put tail is 50 Which means that I'm only grabbing the last 50 rows of the data frame because the data frame I mean if you kind of see in the sample here, too, it goes it goes back pretty far Like there's it goes back even like 2016, you know So I just wanted to get the last 50 samples and yeah, I'm printing it out Uh, but when looking at this data itself, I see that well the average orders is like 15 And honestly, I don't like as a data scientist. I'm looking at this and I'm seeing like 15 is like itself in a margin of error Like that's a little too noisy for us to even bother forecasting Um, and so I think we can make a decision to just instead of like saying, hey, I want to know how many drivers will I mean, how many drives we need to staff on like tomorrow day after the day after and every single day I want to transform the problem into, uh, a weekly aggregation where so I have all the orders until today that I can look at And I want my model to determine how many orders will we get in the upcoming week? And for the sake of simplicity, let's say that I want to, um, I only make the predictions every Monday morning. So every Monday, um, I go to work, right? And I want to see I run the model and for inference and I see, okay, how how many orders will we get in this week? Once we get that number, I can staff the appropriate number of drivers because that's all the heads up they need For now, let's just assume that's the case for this company. So when you look at this as a data scientist, you kind of need to revisit the problem And there is definitely a chance that you would need to restructure it. Hence all the exploratory data analysis is very useful because it helps give context to your problem And that also informs business decisions You see how everything's timed together. Yay. I hope you do at least So yeah, I think that's it for today's video and I'll see you in the next one. Thank you so much for joining me So yeah, I put a little note here two less orders to bother forecasting. Let's look at weekly order volume instead And so what we do is well, I now wrote a query for getting weekly orders Um, that kind of looks like exactly the same data frame if you're interested in the code. Let's look at that real quick Okay, so again, we're getting like the number of distinct orders I'm casting converting date into an actual date here And then, you know, this is a this line took me out a really long time. Um So pan to sequel uses a flavor of sequel query formatting called sequelite And I am not a big fan of sequelite Um, you know, it's just another favorite flavor kind of like my sequel postgres snowflake. Yeah, all of those, right? Um, what what this does is essentially I'm trying to get the for every single date I am trying to get the previous monday Right. So basically it says date. I'm going to add one to that date I'm going to get the corresponding next monday of that date And then I'm going to subtract seven days to get the previous monday So that every single date that you see will be a monday And how you would interpret this in the code Over here is that okay on the week of so this, um, what is this august 13th 2018 on august 13th 2018 or the week of that that is from august 13th to august, what is that 20th 19th, yeah 19th that is that next sunday. We had 94 orders come in, right? And that's what it represents here. That's what each row represents weekly number of orders Where each of the the week is basically a date on a corresponding monday I hope that's clear It is clear, mr. Halthor. Thank you. Okay Next well, I'm kind of doing the same thing like I did for daily forecasting I just wanted to plot it out just to see what these numbers look like And it looks like now these numbers are more hovering around like what 80 80 to 100 orders, which is Acceptable. I mean that seems like a reason that seems reasonable Obviously when you think about restaurants, it also is pretty less But for the sake of this problem, I think well, we'll roll with it, right? It's cool But one thing I'm noticing already here is uh You can kind of see like here from uh, september 2nd to like september 30th. There's a huge gap There's actually nothing there. So it's like for a period of two weeks We had no orders same thing in the month of october for a period of a week or two We had no orders same thing in like November 2. So it kind of makes me a little question a couple things about this data set in general But this is our data. This is what we have to work with. I'm not going to question it right now. Let's just do it. Okay So All right now, um, I'm creating this query called base dot sequel This is going to create our data set now Right now we are treating this problem as a regression problem And because of that we have well for regression We have inputs and outputs the inputs will be features the outputs will be the label So let's just look at this first column right here, right? So this was the same kind of uh Calm that we saw before where we had 94 orders on the week of the 18 Of august 18 2018 now we want to predict this 94 on this date august 13th Like I wake up. I go to work August 13 2018 that's today I want to run my model and this is the number that I should be getting because this happens in the future for the next week I get 94 orders now. What are the inputs? They are my features and in this case, let's just say that I want to Um give typically past orders all we have is now past orders past orders can inform his store can form future orders And so I can say okay. Well the model can use the number of past Order how many orders did we get in the last seven days? That is from august 8th till august 13th. That is today And I want to use that to inform my decision and you can see there. They are actually pretty close They it will be a good feature in general But what I found also was really useful is adding another feature which is like a 30 day count So it's like how many orders did we get in the last month? Right and then use that to also inform our decisions. So you can see Be basically both of these kind of help address they can they can help create a trend So if you have more seven days and it's closer to if this is greater And there's a less gap between this number and this number it would mean that we are we're actually increasing in sales Whereas if it was the opposite case where you know, this would be super low compared to the 30 day count Then that would mean we are in a decreasing sales and that will help inform this variable over here So you can again play around with this typically though in I'm only using two features here But like typically I would add other features like I don't know weekly seasonality For example, wherever you would take like weekly variables and inject them to our model or anything else in your data that might help inform decisions Unfortunately, this is a Kaggle data set which also has a lot of spaces in between So like this is all we got to work with. It is what it is, you know My hunch though, you know coming into this is because there's so many gaps Our model is not going to be that great at making these predictions in general Like no matter what model you throw at this because the data is like this like I don't even in the daily case Like I didn't see any like daily seasonality of sorts that really stands out So like Mondays we see more sales or anything like that And there's like so many gaps in between that, you know, weekly seasonality is also affected here too So I don't see any kind of seasonal trends that are that are really being captured at a uniform rate Like it might look like, you know, okay, there's like bumps up and down everywhere But they're not really at a very uniform rate enough for us to you know, kind of eye open or capture Right and because of that a lot of this data is just going to be a little difficult to predict in general But that's okay. We're we'll let's roll with this is the data that's given to us This is the data we have in my company. So let's roll with it So, yeah Now what I'm doing is I'm actually querying for, you know, this these order these two features and the label And I'm constructing this entire data frame. This is just three samples from the base query I showed you the output first of the query because the query can get a little involved But we can walk through the we can walk through the process. It's totally cool. You know, it's a pretty sizeable query So I'm using a so you've seen this before where I kind of encapsulate all of these query chunks logical units into Ctes comment table expressions, which are extremely useful to learn and understand. So I highly recommend you do Um, so this first cte now it just grabs distinct orders like order number and date Which we have been doing for a while now Um, next is weekly orders. So now I'm getting the number of orders by week Uh, which you kind of saw before I think this this is a an outdated version I think I have to replace this week query with the with like there was like a huge Yeah, like this I think I should I don't know why I didn't update here But I I did run the numbers while updating it. So the number should be fine Um, yeah, so you get like the number of weekly orders over here Again, these are like distinct logical units So now with this is I'm trying to construct that feature for the, um, you know Getting all the orders that happened the last seven days, right? What we do here is now Well, we do weekly orders. So for all the weekly orders I'm just giving an aliases order since it's easier to type out and taking all the so weekly orders Well, these orders I'm only extracting the date or rather the week, right? And this is the day on which we want to make the prediction Now in order to do that I need to join it with all the orders all the orders that happened in, you know All the orders that happened in the past and what I want to do is I want to join with this distinct orders But only considering orders that happened from seven days ago till today, right? And hence this join operation What sequel does is like this is the main date and this will subtract seven days from this date And this is this this is a sequelite syntax. I'm not a fan of it, but that's how it's written And so you'll be getting it without you know with just this the output of this table Would be like a bunch of weeks And then the corresponding number of orders that happened within the last seven days of that date Which is we corresponds to right And yeah, this will be one of our features which is orders last seven days Which was the one of the features there And now we do the copy paste thing right over here And we get the orders that happened the last 30 days just by replacing this one thing And this constructs our second feature Right and then we have a label now the label would be the same thing But in the forward direction, right? We want to get for every single week just get the data that's seven days ahead So all the orders that happen from today Till seven days after us and that's what this join is about Right and so weekly orders the correspondence orders and distinct orders I called it label orders because it technically is a label just for the sake of naming clarity And yeah, and this way we construct our label And now all we need to do is join all our features with our labels, which we do right over here So we're doing left joins here because there may be certain situations where you know, we don't have Seven day orders or we don't have you know orders in the last 30 days Which clearly is the case because we saw how How much sparsity there is in our data And i'm using this coalesce right here because basically they'll say okay if this is a nan or none or null Then we replace it with zero, which is the case right if it doesn't exist That means that we just didn't have orders on that day and I do that for all of these variables here And so we have the corresponding week, which is the day. We're making the predictions We have the two features right over here, which is seven days back and the 30 days back Features of number of orders in those time periods and then we have a label Which is like the total well number of orders that we will see in the coming week That's the future and we want our model to predict this and um, just for the sake of Cleaning my data up. I just want only my all my training data will only go back since 2016 january Nothing before that because it's very very sparse when I looked at the data um so Yeah, and so we construct this data frame I hope all of that was clear because that was like the main chunk of constructing this entire frame altogether Okay Now what i'm saying is okay the features are well these two these two columns are the features And i'm signing the label as this and what we're doing is we are Splitting our data into test and train sets by time period now. I'm not going to shuffle this data Because I want my test set specifically to be completely after my train set And if you were to shuffle your data, it would lead to data leakage Basically, you'll be predicting on Situations that are in your you know that are that happen later Like you're training on data that is later than what's in your test set That will lead to data leakage and even though you would see great performance Your model in the real world will be really bad and so you don't shuffle your data Because it's time sensitive or time cohesive Right, and then we just split index train and y train x test y test And now we pull together our uh xg boost regressor So with this first of all, we're passing in when we create our regressor We're passing in like we're basically using 500 Decision we're using 500 estimators or base Estimators or base trees that are used to make the decision Typically more it is the more complex and the more you can catch more and more errors that your previous Trees have made so typically higher is good in general But you can fine-tune this number yourself by looking at how well or how fast You know the model is able to pick up on the nature of data you have And then it's specify a learning rate Uh, and then when we're fitting this model I'm passing I want to fit it on our train set But when evaluating I wanted to Display evaluation metrics for both the train data as well as our test data just to see you know, what's up And what what we're evaluating is the ma which is mean absolute error Now one thing to really clarify here is that the training in um xg boost It happens using right now by default. It's using the squared error loss Right, it takes a difference between the the prediction the label and squares them And that's that's how it's computing the loss and that's the objective that it's minimizing Whereas right now the way I want to display it is using mean absolute error Which is just the absolute difference between the prediction and the label I'm displaying it with mean absolute error because to me that's just a lot easier to see Um, like when I look at you know, this data for example validation zero This is the train ma this is the test ma and then when you can see it's going down So there definitely is some predictive power And this would be like the final, you know test uh ma right that we're getting Now these you can see that you know towards the end here the ma slightly increasing Again, it's because of the reason I mentioned Because xg boost optimizes the squared loss, but we are computing a different loss And my hunch is that like we can't we can't really Optimize the mean absolute error directly for xg boost because It requires like a non zero second order derivative since it is kind of a form of gradient boosting We do need second we need order level derivatives And so because of that We can't we it's not an option to to optimize. I'm gonna have to confirm that though, but that's that's the intuition And I I'm displaying ma again because it's just more visually intuitive here. So when I see 30 here It's like okay on average. We're like 30 orders off the the prediction Which honestly is kind of high, but it's okay. We we can roll with it. It's the data that we have um, this actually just goes to show that you know Your your predictions are only as good as your data. So if you have good data, you'll get cleaner predictions Anyways, right now. So now we can make predictions with the with the test data And I'm just kind of forming this into a data frame Uh, yeah, I'm creating a panda series creating a data frame out of it right here And now I'm plotting this data. So I'm not sure if I even explain This but not plot live has RC params that I've set a default figure size to be 17 cross 3 And the dpi uh to be 300 which I think is required because like you want that high resolution Otherwise by default map plot live uses like I don't know 100 or something like that And now what we're doing is we're plotting out the um The predictions right now. So weekly orders tail 50 which is exactly the same graph as we saw before I think I explained this too and um Yeah, and then we have the actual test data frame which is this orange line that corresponds to the predictions And you'll see that well, we're pretty good at making these predictions, especially over here Except like I think when we come down here, we're consistently under predicting And well one part of the job as a data scientist is to kind of like understand like why this would be happening So if you kind of look at the data though Actually look at the data like just plotted out as a data frame this entire test So whatever you see this orange line here is that will correspond to the predictions column that you see here And whatever is the blue line is the label column that you see here And what you'll see is that well We do see predictions that are going on after like august 26th Uh 2019 But that mainly happens because we're seeing like the previous week had almost no sales And because of that we're predicting consistently low values much lower than we would anticipate Which kind of makes sense like you can't expect our model to be you know great just after closures Just after a period of closures, you can't expect the model to just like come back up Because that's just the nature of the data and how it is right I'm assuming though that like with actual realistic uh shopper or order information You would have like consistently daily volume that's coming in and potentially even more predictors from which you can inform your model about you You know so that you can use other regressors And yeah, so ideally you'll be able to even in real life make a more um A more robust model than what we have here But I hope that this entire I this entire project kind of give you an intuition of what we could be expecting right And yeah, I'll put all of this code up on github Um, I do have like a tidbit of using, you know, facebook's profit. Um, that's facebook's um Uh interpretation of like approaching time series analysis I do get like something here, but this is like for another time I want to kind of iron out these numbers a little better Regardless the code will be in the description down below So if you like the video give it a like give it a thumbs up share the video like comment do everything you can Trying to grow a community here, and I'll see you in the next one. Take care