 A couple of warnings. I don't know whether it's a warning or not. This talk is a bit more technical. I presented the same topic in my department seminar, and I didn't change anything because I'm pretty lazy, so. So you might find this bit technical. As I said, it might not be warning for you. If I were you, I'd really like things to be more technical. So OK, let's start. My background is I'm doing a PhD in AI in National University of Singapore right now. Before that, I worked in Flipkart for three and a half years. Before that, I was studying in another university in India. So anyway, this work has been done jointly by some of my colleagues in data science team in Flipkart. OK, let's start with the motivation first. So this title is very cryptic, as I understand. So we slowly understand the meaning of it. So motivation, why do I really need to do forecasting? Well, I'm not going to talk a lot about the business reasons. Forecasting is very important for any commerce company. It's not some fancy cool stuff they do. It's kind of bread and butter for them. If they don't do it correctly, then they're quite screwed. So you don't build these things just for fun. It's very important for the business. So that's obviously the most important motivation. But why am I here? Why do you really descend to me? That's the thing I... So ideally, people don't use deep learning. By the way, this talk is also on deep learning. We use deep learning to do the forecasting. So a key takeaway could be from this talk is how you can use deep learning in forecasting. So if you have some similar problem that you're facing in forecasting or doing some sequential data prediction, then probably this talk might come handy for you. And third thing is normally deep learning doesn't get used in structured data. Deep learning really shines when the data is unstructured because the other methods don't really work well. In structured data, other methods work quite well. So you can argue why you are using deep learning when this is a structured data problem. So again, I don't have a theory to back it up. But as we have seen, the results is much better. So that's a very loose way of saying things. As I come from academia, I really shouldn't talk like this way. But anyway, if I show you the results, it works much better. We really tried hard using other methods also. They work reasonably well. But with deep learning, we got more success out of it. So that could be another motivation. So let's start with the problem statement first. So this is the primary objective. We want to forecast daily sales of all the products that Clipkart sends, which is roughly in the order of, I think, 1 million or something. So the data set you have is for each of these products, we have all the history, like how much it has sold, what was the price on a particular day, what was the offer we ran, what are the discount we had for this product, many other things, actually, whether it was displayed in the home page, whether it got a big banner or small relisted in the home page. So all these things, actually, we have. All these data captured. So given the entire history, now we have to predict the sales of each of these products 30 days ahead. Normally, it's really forecast, but sometimes, depending on the use cases, we do weekly also. So we aggregate the forecast at a weekly level and then forecast. And all these problems are sort of interrelated. So if you solve one problem quite well, then you make good progress in the solution of other problems. As you can see, the price elasticity is something that you have to predict the sales function of price. So that is nothing but forecasting as well. And the gold sick problem, which is the inverse problem of price elasticity, that is you have the sale target with you. You want to know what price you should charge the customer. And the fourth one is predict the sale of new product. So new product is something which we haven't seen before, completely new, like the iPhone, which is going to launch in a couple of days. And we want to predict the sale of that. So it's a very hard problem. Since you don't have any history, there is a very high chance you're committing some behavior. So we are not going to touch the fourth one that much, but I'll give you some motivations on how you can go about solving that also. So what affects demand? This is a high level overview of all the factors. There are many other factors probably you can come up from your own intuition. But I think we have reasonably captured most of the things that affect demand. By demand, I mean the sales. Actually, demand is not exactly sales. If some of you have come from MBA background, you probably recognize. So demand means that actual demand, the maximum amount of sale that can happen. And sale is the actual sale that has happened. Well, I would just highlight two columns here. One is price and visibility. These are the most important drivers of sales or demand. Of course, price is also combined with the discount also, so actual price and discounted price and all these things. So we have many different types of price. In fact, we vary the price over a different time point of day. So you can have some sort of statistics of price, like median of price or maximum price, maximum price over a day, minimum price over a day, all these things. Well, when we started, we tried to do some feature selection engineering, which is a traditional way of tackling this problem. We have all these features, many, many features. And we intelligently come up with different types of engineer features and try to see whether that works or not. And that works reasonably well, as I said. And why is Q forecasting is tough? Because just, see, look at this graph. Sorry, you can't. OK. So in the x-axis, you have date. And the y-axis, you have the sale. So you can see the variability in the data, right? It's extremely noisy. You can't come up with anything which would predict what is going to happen in the x-axis. If you just use this data. Well, I'll just summarize what I just mentioned. So we first started with some time series based model, which is this so-called Arima model, auto-regressive integrated moving average method. So this is the equation. Basically, it says that today's sale depends upon last previous five periods of sales. So time series forecasting is pretty rich area in statistical learning. So there's a lot of literature in that. So you can go through all these things. But I don't have much time. So I'll skip all these things. Quickly go through all of them. Then we started with Cubist, which is another model. So what happened? We cannot do a good job using just one time series and then predicting. We can understand that this is a hopeless case. So we really have to understand what caused that spike in the sale or what caused the trough in the sale. So for that, you need to have some kind of causal model where you have six features. And basically, you train a model using those features as your independent variable and your sale as dependent variable. And I tried many methods. We tried many methods, like Exhiboost, all these popular methods in Kegel community. So you can run all these methods in a for loop and see which one does best and pick that one. It's very easy. I would say very mind numbing work. So all these things work reasonably well. Cubist don't go by them. It's a very simple model. It's a variant of decision tree-based model. The paper was published in 1993, I guess. It's quite old, in that sense. You can think of Qubis as random forest. Just replace Qubis with random forest. Same thing holds. Now we come back, come to the main topic of our discussion, which is deep learning model. So I'll be fairly quick, but also I'll try to give you, I'll spend more time on subsequent slides. But still, if you have some questions, you can go through our paper, which is there in the archive. So you can just download the paper and you can read through it, if you want. Or unless I'm around here, for most of the times, you can catch me and ask questions also. So main contribution of this paper is we propose an RNN architecture. So how many of you know about RNN? By RNN, I mean recurrent neural network. Quite a few of you. Who don't know anything about RNN? OK, I see. I'll give you some slight bit of idea of what RNN is. So how many of you know basic stuff of deep learning? And how many of you know about some basic concepts of probability, like conditional distribution? Quite a few of you. Likelihood maximization? OK, cool. So some of you are able to get more out of this talk. Those who don't understand everything, then you can just go through these things first. These are more basic things. You shouldn't be in this talk if you don't understand this. You're very honest. OK, so our model doesn't give you a point forecast. This model gives a probabilistic forecasting. By that, I mean normally, ideally, the model gives expectation of a distribution, which is a single number. Here, we are giving a spread. So we are saying that the prediction lies between y1 and y2 with some probability. So how it makes probabilistic forecast? So in our model, we find out the estimated distribution of sales. And actually, we find out the actual density function of the sale. And when you have the actual density function, you can draw Monte Carlo samples from that distribution. You can use MCMC, Markov chain Monte Carlo, some other methods, variational inference, lot of other methods are there. If you want to sample from a high dimensional distribution, you can do all those things. And it can get all those samples. And using those samples, you can, when you have all those samples, you can easily calculate mean. You can calculate 90% quantile. You can calculate 10% quantile, everything. And this model is trained on a cohort of products. It is not just trained on only a single product. So because of that, what happens? Some of the products doesn't have much history. So the product, which is just launched, but by the way, we are not covering the case when the product has no history. We are only covering the case when the product has some amount of history, say 10 days at least. So if you have a product which is fairly recently launched, then you have very less history. And then if you just do a time series prediction, then it won't work well because you have very less history. Of course, we just told you that time series prediction doesn't work well, but suppose you do a time series regression, then also it doesn't work well because you just 10 days data. So because of that, we need to leverage the history of other products which are similar to them and which have sufficient history and use that somehow to make the prediction right. So our model does that. This is the technical. So our approach can generalize for non-Gaussian noise. So that means we actually assume that the distribution follows Gaussian distribution. In fact, a mixture of Gaussian. So here I'm saying that you can assume that you can model the data saying that my data comes from a Poisson distribution because that makes sense because sense is always positive. Poisson has support only on the positive points. So for x greater than or equal to zero, it doesn't put any probability mass on x less than zero. So you can say that my data comes from Poisson or negative binomial. So if you do that, you can perfectly adapt your likelihood to this model and everything works in a similar way. Less manual feature engineering work, the stress is on the less. You might expect that deep learning will do all the heavy lifting like into all the future engineering for your use case, but it doesn't really work that way. So it sounds really good that deep learning does automatic future engineering everything but in real life in the wild, it doesn't really work that way. So we had to do some amount of future engineering, not much but some. Okay, now I'm going to talk about a little bit about how recurrent neural network works. So those of you who haven't heard of recurrent neural network, so I hope you understand a little bit of linear algebra as well. Otherwise subsequent notation is hard to understand. So here, so this x is our price, this x is our price. Say that we have just one feature in our model, which is price. Then this x is not a vector anymore, it's just a scalar. So each of this x is a, so this is a price on 0th day, this is a price on first day, it's like that. And this h naught is called state vector, so internal state vector. So we take, so come here, so we take x two, so this arrow means we are doing some operation, normally linear operation. So linear operation I mean is we do a cross product, sorry do a dot product with a weight vector, right? So we are taking this and, so we are taking a dot product here and then we have ht minus one. So ht minus one is the state vector from previous state. And you combine them together, you get your ht, current state vector. So if you expand this recursive formula, you'll see that ht depends on ht minus one. ht minus one depends on ht minus two and xt minus one. And so on and so forth, actually it depends on your entire history, right? And your zt is basically the cell, by z i means cell, the cell is just the function of this. And I have put the color as not i, actually this has been taken from a block by Crisola. So I think some of you have come across it, it's very popular. So by green means that these weights are shared. So these weights are not changing. So whwx, bh, these things are remaining same over the time. So that way you don't have a lot of parameters, you don't have to learn a parameter for each time step, right? That way you can scale over a very long vector. This is just vanilla RNN, this is not the RNN people use nowadays, people use LSTM, which I'm not going to go into much more complicated but the idea is similar. Okay, now let's come back to our model. Well, let me explain the notation first. So as I said, j is our cell, i means the ith product by the way. So we have say 10,000 products and i is the ith product and t0 is the current time, t0 is the current time today. And capital T is t0 plus 30. If you are predicting for next 30 days, right? And so we are trying to model this density function, right? Where this notation means all the cells from t0 to t, t0 to capital T. So from today to 30 years later, right? So this is concatenated into one vector and similarly, I mean if you follow this notation, this means from first day to previous day. This is the same of first day to previous day. And this x is first day to, so you can see that for x, we have one to capital T. So that means you're assuming that we'll have all the features data in the future also, which is slightly unrealistic, I know, because if you ask business at what is going to be the price for after 10 days, at least in Clipkart we found it hard to convince them to give the data to us. So, but this is the assumption we have to make. We cannot predict without knowing what is going to be the price. If you suddenly decrease the price, of course, it is going to be higher. If you don't tell me the price, it's impossible to do the forecast. So that's why you have taken from one to capital T, which is the future. T is the t0 plus 30. And note that I have written x as a vector. So it may have price and other elements also. Okay, now I'm writing the likelihood function. So likelihood can be written in this way. So you have these things. So basically, we are writing likelihood as this joint density only. So likelihood is not this actually. We have to take log of this, right? We are talking log likelihood here. So then, sorry, no, this is not log likelihood. This is likelihood, sorry. Anyway, so this is just this probability, I think. I should have used the same notation. Anyway, so here you are taking the product from, so given all these things, these things we are assuming, this is an assumption that these are independent. So that's why we are taking a product. This means product. So from t0 to capital T, you are taking the products. This is called conditional independence. And okay, so yeah, so next thing is that this, so this whole thing, this whole thing has been summarizing to these. So this whole information in the past, and plus the x of the future, everything. I think I should skip the notation. If you have understood the notation, it's really straightforward. So all these things have been summarizing to these. So it's just a new notation for that. Now this thing, this HIT is modeled as LSTM. So this HIT is, as you can see, it exactly has this form of LSTM. So it HIT depends on HIT minus one and ZIT minus one and XIT. Okay, now, given that we have this model and we optimize this model using stochastic gradient descent or whatever, then we can, we'll know this value of theta, right? This is currently unknown, but when we have optimized our model, then we know the value of theta, and then we can, we know this density now. Once we know the density, we can get Monte Carlo samples using NCNC or the techniques, and then we can do all those things. We can say what is the mean value of the prediction, what is standard deviation of the prediction, what is one time, all these things. Okay, I'll give you an example to understand what I just mentioned. So suppose you are using Gaussian likelihood, right? So in Gaussian likelihood, you have two parameters, right? You have mu and sigma. This is a density function of Gaussian density, right? So in this case, you have two parameters, mu and sigma. So our neural network will give these two as output. So this is coming from this LSTM model. This HIT is coming from our LSTM model, right? So HIT, and notice that for sigma, this is actually sigma square, by the way. So for sigma square has to be positive. So we have to take a transformation like this. So it's called soft plus transformation. This transformation makes sure that your sigma square is greater than zero, right? Because log of anything, log of anything which is greater than one is greater than zero, right? So, okay, now this model, we first try it with this model, but this has some problems. Problem is that flipkart data is very sparse. So it's called 80-20 rule. So it means that 20% of the product drives 80% of sales in the category. Because of that, most products actually don't sell much. So you have a lot of zeros in the sale, right? For most of the products. So basically what I'm trying to motivate is, your data is multimodal. You don't have a single mode. But if you just model using a Gaussian, it has single mode, that's one mode, right? Here we are saying that our data has multiple mode. So first we try it with this model when you just one Gaussian model, one Gaussian, then it didn't work because it assumed that it has one mode and product we sell a lot and product sometimes sells a lot on some days and sales not much some days. They all club together and become concentrated on the mode and it gave a very average of prediction. It didn't capture the dynamics of the sales versus price and other things. So we had to assume that our data comes from multimodal distribution and simplest way to do that is using a Gaussian mixture likelihood. So mixture likelihood is typically has multiple modes. So these modes are mu one, mu two, all these things, right? But in this case, we have more parameters. As you can see, if you take two Gaussian, then you have two mu one, two sigma on square. So four of them. And then you have P one and P two. But P one and P two are one parameter actually because P one plus P two equal to one. So basically you have, if you have two Gaussian, you have five parameters, right? So, of course, you increase the number of parameters and next thing you need to take care of because the P is your probability. So you have to do some softmax to make it so that it sounds to one, right? So this is just a pictorial thing, a pictorial evidence of how Gaussian looks like. This is two Gaussian here actually. So you can see this is one Gaussian, another Gaussian. So it has two modes, one mode here, another mode here. Okay, this is the final architecture. So I'll just quickly tell you what it is. So as I said that you have all the same, we right now we have three features which is price at time t, offer at time t and this P is basically product embedding. I'll come to that. So product embedding. So the problem we face, I mean sort of problem, I mean should be straightforward to understand. So most of the, so if you just don't consider which product it is because you are training the model on all 10,000 products, say in the category, then if you don't somehow give this information that this is this product because each product has its own characteristics of sales, right? Then it doesn't work. So how can you take product ID as a feature? Then becomes a problem because if you have 10,000 product you one hot them, it becomes 10,000 cross one vector. And if you are in book category, then you have one million cross one vector. So which is problematic. So that's why we had to do some embedding which is very similar how what to heck works. So you have this product ID and we have, this dimension is much lower than 10,000. Here it's a vector also. This dimension is like, say if it's 10,000 this is like 100 or something. It gives you a much dense representation of this thing. This very sparse here is much dense. So, yeah, right. So as you can see here, so this is your feature. This goes into this. So this is the dense. So this is also some categorical. There's some categorical features here. So all these things you can think of as continuous features. There are many other embeddings also which I didn't put here because it'll be too long. It won't look good here. So, and this becomes your XT. And XT minus one also has all these things which I haven't written here. So instead of T, you'll have T minus one, T minus one, T minus one here. And XT plus one also will have these things. Now it's going here and then here and then from this HT you are predicting all these things. I'm assuming that you are talking about two Gaussian mixture. So you can see that this is the parameter. Okay, now how we train this model is very straightforward. As I said, I mean some product might not have much history. So if you just train model on one single product, you have very few data points. Say product has 10 days of history then we have just 10 days of data which is impossible to train anything in fact. And also you won't be able to learn from other products what it's going to have. So that's why you are training in a homogeneous cohort of products. Of course you don't want to mix mains clothing with say mobile phone. That would be ridiculous thing to do, right? So you have a homogeneous set of products and you are training the model on that, right? Actually Flipkart and any e-commerce company in fact, Amazon also, they have a hierarchy of products. Like you start from lowest level which is product ID and you go up. Like after that I think it's vertical and then category, super category and then you have the entire Flipkart. So you can actually build a model at each hierarchy and then after training the model then you can fine-tune on each of these. You understand what fine-tune means, right? So by fine-tune I mean you reach a parameter and then from that parameter as a starting point you do your super security lesson. That is also possible actually, you do that too. So this model training is fairly straightforward when you have the likelihood which is our Gaussian mixture you take the negative of that log likelihood and you want to maximize your likelihood so you have to minimize your log likelihood. So that is your loss function, right? Just like classification, in classification we assume that the data is coming from Bernoulli and then take the log likelihood that that becomes machine learning community gives a fancy name to it which they call it cross entropy loss. It's nothing but log likelihood actually. Anyway, actually no actually cross entropy has some connection with information theory also. Anyway, okay so optimizer that we have used is Adam Osterkosti gradient descent or batch gradient descent with momentum of course. You need to use momentum otherwise it won't work. This is highly non-convex problem. So this is the experimental evaluation. Quickly go through this, don't have much time. So as you can see that this green line is actual sales and this blue line is a Cubist model. Can you see from the back? I don't think so. Just believe what I'm saying then. So this blue line is the Cubist and this pink line is our model's prediction. So as you can see there is a sudden jump in the sales and our model can capture that thing really well. By the way, this is not an ideal way to look at how your model performs because you're just looking at a few examples. This is some good scenarios that I have captured so that the paper gets published. So but actually you should look at this, right? So in the summary of performance, so you are taking WMAP. So in the WMAP you see that everywhere our model, so NM is neural network, the ARD MDN model. That model works much better than the other models. Other models being Cubist here. Basically Cubist is the best model we have seen amongst other models. Now this is the category-wise performance. Again, don't try to figure out what is going on here. Let me just say that our model works better. If you want, you can come here and get a closer look of what is going on. Please believe me, it's fine. So next I'll just conclude by saying a few things and some future work that can be done. Okay, so one thing we didn't take into consideration is cannibalization, which is very important for certain categories like electronic accessories where cross-cell or up-cell is the norm. Like if suddenly iPhone cell becomes much higher then you'd expect the iPhone case will also go up, right? So our basic assumption fails here. So if you look at, so you can see that I'm summing over i equal to one to n. So that means I'm saying that our data is independent. Our sales is independent, it doesn't depend. So one i doesn't affect the other i prime, right? So that's erroneous, that's not correct. So, but anyway, we cannot really model the actual thing. We try to approximate it as much as possible. So that could be the good direction of research, how we can incorporate some correlation between different products. In fact, it will really help for fashion categories also where there is a lot of cannibalization, right? Red shirt, blue shirt, green shirt, they all cannibalize each other, right? And certain categories are really hard to solve, namely fashion, again, I pick fashion, yep. Yeah, okay, yeah, so fashion is the biggest pain point for us and in fact, Amazon also is, Amazon has like a team of 100 data scientists who only work some fashion for this. It's a very hard problem to solve because of very sparse data and extreme cannibalization going on all the time. So, we don't know what to do here actually. We have to do some research about it, yep. Next is how can we predict the sale of new product? This, I didn't touch, but I guess this shouldn't be much of a problem because for new product, you have to take the product attributes as your feature and then that product attribute would be a very sparse feature so you have to learn some embedding and in the embedding space, you need to do some kind of similarity metrics or some distance or something. I am not sure how to go about it, but I think it's really much difficult. Next is like the Utopia, which can happen or won't happen ever, I guess, I don't know, but this leverage forecasting to automate planning, buying, pricing, everything. I guess this is too hard right now. I mean, I don't see this happening in the next 10 years. Unless India becomes as matured as, India market becomes as matured as US or Europe market, where Amazon has reasonable success in that they have like made 90% things automated, but not everything. But in India, it's very wild. A lot of human intervention, manual tweaking, all these things goes on and it doesn't really work that well. Okay, so the conclude, yeah. Yeah, yeah, we tried, we tried, we tried that too, but it didn't really work that well. Like, yeah, yeah, yeah, we tried that also. So those also, those all come, the state-based model you're talking about, that all comes in, we call that dynamic linear model, DLM. So we tried that. So there is a paper we found out so, but that uses Kalman filtering to estimate the parameters and all this stuff. But that didn't really work as good as this. Cubis is the best, as I said, I tried really hard, yeah. Well, those things, it's not in our hand because we can't ask Amazon, say, hey, give us the price of products for next 30 days. Yeah, yeah, of course, of course, of course. Those all goes into the error term. That's why it's a probability distribution. The variance of the error is quite high, I'd say, but we can't do anything about it. Exactly, exactly, exactly, yeah. So you can see that as we go along more and more weeks in the future, you'll see the error is going higher and higher because of that problem on, yeah. Actually, it's a good question because this is the biggest pain point to deploy this model because you need all these future inputs. And it's very hard to get from the business because at least in Flipkart it's all manually driven right now. The business guys don't easily give these data to us. So because of that, I would say right now, that's probably the reason why it's not deployed yet. I don't know what's the situation now. I did it like a year back. But of course, and this model assumes that you all have all the future inputs. You have all the future attributes, right? Future features, yeah. Yeah, yeah, of course, yeah. So those also called event embedding, which I didn't show here. I just showed three things, price, offer, and product ID. So I could have included that also, yeah. Well, no, no, no, train the modeling house. So at that time I used TensorFlow, but if you ask me right now, I would use PyTorch, definitely. Yeah, yeah. No, no, no, no, no. Any other questions and comments? Yeah, yeah, yeah. Every feature is indexed by time t. So as you vary time t, your feature changes because today's price is not the same as tomorrow's price, right? So it's a function of time t, right? Of course. No, no, no, no, no, no, no, no, no, no, this, this. Yeah, I understand, but this one doesn't have that. So here the set of features is fixed beforehand. If you have new set of features, problem is suddenly new set of features comes, you won't have much data to train a model on that, right? Yeah, anything else? I'm around here, but you don't have much time, so I'm leaving the stage for the next speaker.