 Hey everyone. So today we are going to look at a really important concept called quantile regression And I think the best way to actually look at this is through an application of like how for example Instacart would use this So in general a very very, you know quintessential example for Instacart in in machine learning is like Okay, you're given like a distance from a grocery store to a buyer's location How long will it take for? You know a driver to go from the grocery store to the buyer location with all the groceries So given distance determine ETA, which is kind of like what the problem here Yeah, the problem here kind of symbolizes the same thing now In general if we were to just say okay, let's say that for this buyer a If it's like I don't know 10 miles it'll take 45 minutes to go to the or let's say like let's say let's give it a better number 30 minutes to go to a Store and that's what a model would project and this you can do with some some simple regression technique But on the app typically we see like instead of 30 minutes It doesn't say like your food will be here in 30 minutes or your groceries will be here in 30 minutes It'll be like okay between 25 and 35 minutes converted to a time, right? So there's always this lower and upper bound or it'll say your groceries will be here within 35 minutes So in order to compute these bounds we use quantile regression and that becomes super important here So now that's like a basic intuition of where we would use quantile regression So let's start getting some a little into the weeds with the math Not too much and then we have like some code right under here. So let's start with that So typically like I said any kind of regression problem We kind of have this kind of loss of a squared error loss where y is the actual Time in this case the actual label and then x theta is the late is something that's predicted by our model Which would be the ETA in this case now for quantile loss it changes quite significantly Well, actually not that much. I'll explain it. It's it's a it looks pretty complicated, but it's a pretty simple formulation now If we want to let's say we want to penalize our loss if the percentile is low But the prediction is high and then we also want to penalize loss if the percentile is high But the prediction is low. So what we mean here is like with quantile loss We are predicting a percentile within which we are sure that The order would be satisfied. So let's say this is like the upper bound or the lower bound or some some bound in that case a Good example would be let's say for ETA case right in instacart typically it would be okay 30 minutes would be the label That would be in a typical regression case But I gave a 25 to 35 bound right let's say 25 minutes is like there might be a 10% chance that we want to say that There's like a 10% chance that would be the lower bound and then there's a 90% chance That'll be within the upper bound and then our model makes projections. So let's say our model says 25% chance. Sorry 10% chance it'll be there within 25 minutes But then there's a 90% chance that'll be there within 35 minutes Which is why instacart on the app would say your groceries would be here between 25 and 35 minutes and each of these two 25 and 35 are a part of like they're the output of the regression model due to different quantiles And that's what tau is here. So tau can be any number between 0% and 100% or 0 to 1 actually now Here it's this first line. Let's just take a look at this first line here So if y minus x theta is greater than equal to 0 that means our predictions That means that the predicted value is actually low from our model And this is good if we are predicting only like the lower percentiles Like if it was a 10th if tau was like 10% like the 10% quantile because we expect it to be low but we want to penalize it if tau is high like if it was a 90% quantile and We're seeing that the prediction is actually lower than the label and much lower than the label We want to penalize that right and the exact opposite is true for the second line that we see here So if y minus theta x is less than zero that means that well The prediction is much higher than the label Which is only good for higher percentiles or higher quantiles and it's we want to penalize it for much lower Quantiles and we're saying one tau minus one instead of one minus tau because we want this product to be a positive number And that's it actually that is all about the quantile loss You understand the math if you understood this you understood the math and we can actually just jump into the actual problem here and implementation so Right now the big problem is Like I said before let's build a regression model that determines delivery time based on the distance from the buyer right So you have certain libraries that were important here importing here First is make regression which will create our regression data set then we have pandas, which is our go-to library for day for manipulation Map plot live for the pretty charts see born for the pretty charts numpy for the math Test train split to split your tests and train data now light GBM is Basically an implementation Microsoft's implementation of a graded boosted decision trees That is actually pretty good and very easy to use for quantile regression. And so I use it here Now this chunk of cell is basically for us to make the data set I didn't get this data set from anywhere just making it on my own and tweaking it So we have 10,000 examples with one feature. That's the distance And that feature is informative and I'm just saying like a random state is 42 just to kind of just set the value there and right here the data frame. Yeah, I'm just converting it to data frames and What we're doing is I'm kind of adding some noise and then trying to shift the mean and standard deviation So that it becomes actual represent actually representative of like a distance in terms of miles or like the time to buy or in terms of minutes And if you kind of look at the distribution of like the the feature in the label They kind of do look pretty legitimate. They look like legitimate feature and a legitimate label right here I'll split into train and test sets, which is like a 90-10 split And then yeah, that's that's it. That's all about, you know getting our data So we have our data on hand and now we can play around with it So let's first visualize some of the test data. So the test data is now like it's a 10% So that's about 1,000 samples, right? So plotting distance versus time to buyer you get this little pretty chart You can kind of see you can It is a linear relationship between distance and time to buyer So it actually shouldn't be too difficult to model this data Um, let's see right now. Okay, so here's like the chunk of the actual training process, right? So I'm saying tau which is going to be our quantile ranges Is going to be like 10% 50% and 90% and I have to iteratively keep training different lgbm models because For these tree base for, you know, tree base boost boosting regressor models We can't just train one model to get all the quantiles We have to train only one of these regressors each time for a specific quantile This isn't actually to this isn't really a big deal because training time is also not that slow for these regressors I'll probably give you a tidbit later. Actually, if you wanted to try this out with a like neural network For example, you can have the neural network just take the same input But the output could be three neurons where you have like a 10% quantile a 50% quantile and a 90% Quantile and you can add any other quantile you want And so the input would be three regress values the output would be three regress values And you just need to change the loss function to reflect This neural net quantile loss and then you get like a sorry this this quantile regression loss And you get a neural net quantile regressor pretty fun stuff I didn't do it here, but that's just like a pretty cool idea that you could probably take care as a as like a little personal project So, okay, we fit the we fit this Quantile regressor. I'm fitting it in as like, okay I'm saying hey I want to minimize the quantile loss and I say that the the percentage or like the quantile here is 10% or one of these and Then we fit it We make predictions on the test set and then I'm just appending it to like I'm just having this huge dictionary that I'm just like continuously appending it to and Then I'm constructing a data frame from all the the actual predicted values and this time to buyer label And so in the end we end up with like this kind of data frame where we have a feature We have the three predictions made by the models and then we have the actual label the actual time it took to get to the buyer and Typically this value will lie between, you know, the 10% 90% There are some chances like here where it could be slightly greater than 90% or even slightly less than 10% But like I think in most cases like a buyer would be like, okay Your time to arrival is between 53 minutes and 70 minutes Which seems pretty legitimate here because it actually arrived in 54 minutes. So that's good Next I'm doing like I'm doing something called like melting this data frame basically you take all of this and you Convert all these columns into just like a single row of values and then have their corresponding value in each cell Reflected in this value column. I do this because in this next part. I wanted to plot the data out so Here, let me just go here. You see these blue ticks, right? These are the actual Labels, right? And here you can see the same blue ticks are actually the same labels But then we also have each for each of these thousand blue ticks We have a thousand orange ones a thousand green ones and a thousand red ones which signify the the tenth percentile 50th percentile 50th percentile and 90th percentile respectively and so like when we're making a prediction for like I don't know like one of these you and you want to get a lower and upper bound We have it just by you know, just by looking at the orange of the red ticks And so we can get for this Instacart example, you know, you can get like lower and upper bounds, which is pretty cool Now in order like maybe like looking at this picture, you could see oh, yeah, it does kind of look like a lower and upper bound ish You can kind of verify that just by looking at the nature of the test data itself So for the 1000 examples, right that we have how many cases is the label greater than the 10 percentile? Prediction by the model and it indeed is 90 percent of them So it's only 10 percent of them is actually less than the actions than the then the This value predicted by the model How many of them are greater than the 50th percentile? It's almost half 50 percent and how many of them are greater than the 90th percentile Well, it's only a it's actually exactly only 10 percent So clearly quantile regression is working in a way that you intuitively would think right and And you can do this for almost any other application, too I have a couple of resources here for the Instacart and there's also a pretty cool Quantile regression blog. That's right over here that you can probably reference Yeah, and that's about it. So I hope you understood everything in this video I'm gonna probably put this code up on github and Please like share subscribe do all that good stuff go the whole nine yards and I will see you in the next video Bye-bye