 I'm going to talk about product size recommendation for fashion e-commerce. So this was work that I did while I was at Amazon. And this was presented at Dub Dub Dub 2018. And my collaborators for this work are Vivek Sembiam, Rajiv Rastogi, and Adul Swaroop. OK. How many of you tried to buy some apparel online? Almost everyone. How many of you found it easy to pick the size? Very, very few. Not anyone, in fact, almost. It's hard for the customers to choose size online. And why is that? Sizing is often inconsistent across brands. So as an example, suppose you are buying a shoe. Reebok could have a size mapping, such as size 6 is 15 centimeters, and size 7 is 17 centimeters, while Nike could have a mapping, such as size 6 is 16 centimeters, and size 7 is 18 centimeters. And this creates confusion for the customer when you buy across brands. On top of this, many sellers do not often upload a size chart. So size information is often missing or not accurate for many items. So this creates a lot of confusion, and the customer ends up picking an approximate size. And as a result, might end up returning an item. So size-related issues is one of the major reasons for returns in e-commerce for fashion. So the aim of this project was to reduce size-related returns. So let's look at an example. This is a product detail page. And you have a shoe here. Let's call this shoe the parent product. It's a product. And there are many size variants that you see in this dropdown. Let's call each of them a child product. The aim of this problem is to, for a particular parent product, for you as a customer, can I recommend the correct child product or the correct size variant? That's the problem. And what data do I have about you with which I can recommend? That I have is if you've made any purchases in the past, it will give me some information about your size. So that's the best data I have about you to tell me about your size. So that's what we want to use as well. Not just what you purchased earlier, but also what you returned. For instance, if I bought a size 7 shoe and I returned it saying that it is small, maybe my foot is larger than the average size 7. Maybe size 7 of that brand is smaller than the average. So and if I look at a lot of transactions like this, I want to make some aggregate. I want to look at all this data and make some better inferences. And I want to learn how to recommend size. That's the goal. So if you look at this table here, this is another way of looking at this problem. The rows are all the customers. And the columns are all the different products, child products, which are size variants. So I can see that some of these grids are filled with fit, which means the customer, like the product, and kept it the right size. And the small and large mean that the customer actually returned the item. And you can select the return code typically when you return an item and the customer selected either small or large. That's what it means. So we have this data and we want to fill all the remaining cells. That's one way of looking at this problem. And you might think this looks like your standard matrix completion problem, right? But the difference here is this is highly ordinal in nature. For instance, Nike size eight is less than size nine, less than size 10. And there is probably like a fixed difference also between these. And traditional techniques will not capture recommendation or techniques will not actually capture this ordinal information. Even my labels are ordinal in nature. My small, fit, and large have a natural ordering, right? So again, we want to leverage customer purchase and return data. And the approach that we take is to find the actual size of customers and products and use this to make recommendations. So for example, if I know a size for the customer, a true size for the customer and true size for every product, then given a parent product, I would pick that child product which is closest in size to the customer's true size. Right? So that's intuition. That's how we want to do it. But how do we do it? Before we go to how, let's look at why this is a hard problem and some challenges here. Cold start issue. This is true in any recommender system. Most of the customers and products do not have any transactions. Sparsity. Most customers and products that have transactions have very few transactions. So I'll show you more on this later. Multiple persona. So I might buy a size eight for myself and I might buy like a size four for my daughter and that would definitely confuse recommendation algorithm. Noisy return data. Customers do not always indicate the correct return code. So when somebody is returning an item, what motivation do they have to actually indicate the correct return code? So maybe I see a lower price somewhere or just don't like it. I might just say it is small. So that happens a lot, right? And another reason is customers do not always return not so great fits. So if I buy a T-shirt and I see that it's slightly larger, one size up, I might just keep it because okay, I have one T-shirt that's one size up, it's okay, right? A lot of people do that. And this actually makes the data very noisy, whatever we have in recommending in future. Right? So again, like now how do we approach to solve this problem? So let's just have some notation, not a lot. Let's say I is a customer and J is a child product or a size variant. And I will say SI is the true size of the customer and TJ is the true size of the child product. But true size, I do not mean the physical size. What I'm talking about is a normalized catalog size. If there is one axis on which I can project a lot of different things, then I'm calling that a true size here. So let's say SI is the true size of a customer and TJ is the true size of a child product. My data is the form of a lot of transactions. Each transaction has a customer, a child product, so an I and a J, and a YIJ which is a fitment code so it is either fit or small or large. Depending on whether the customer kept it or returned it saying small or large, right? This is the data I have. Now the intuition that I have is that suppose it is a small transaction, then I would expect SI to be greater than TJ. That is the customer size should have been greater than the product size if the customer returned saying small. In other words, I would say SI minus TJ is greater than some B1, some threshold B1. Another way of saying that. Similarly for a large transaction, I would expect the customer size to be smaller than the product size. So SI minus TJ is less than some B2. For fit transactions, we would expect SI and TJ to be very close to each other. And hence SI minus TJ is between some thresholds B2 and B1. So this is my intuition. And now I know that these constraints need to be satisfied. What is the best value of SI and TJ I can find? My problem now is that, how do I solve that problem? So you can see the same thing pictorially at the bottom. So you have SI minus TJ plotted on the x-axis. All the fit points are clustered around zero because SI minus TJ should be close to zero for the fit points. All the smalls are on the right side and all the largest are on the left side. This is the same intuition that you saw earlier, right? Okay, so now again the question is how do I find SI and TJ that best satisfy these constraints? So we've tried two approaches for this. The first is, as data scientists, the first thing that we do if we have like a bunch of constraints is create a loss function for it, solve using gradient descent. So that's the first approach. So we actually try to create a loss function approach for it and then try to solve it. The second is, we did a Bayesian approach. So I'll talk about the first one very briefly and I will go a little more in detail about the second one. Okay, so this is the loss function approach. So you saw four conditions, four constraints earlier and there is one term for each of these constraints. So for instance, we said SI greater than TJ for a small transaction. So if it's a small transaction and SI is not greater than TJ, you want to add some small penalty for it. Right, so that's how we wanna create a loss function and we use a hinge loss here. So essentially, we have four constraints which is one for large, one for small and two for fit because there were two inequalities in fit. So we have that loss function here and we solve this using gradient descent. So that was one approach that we tried. And this was published in REXIS 2017. You can actually take a look at that if you want more details. But essentially the problem with this approach was that this was coming up with a point estimate. So we were finding one value. So finally, when we solve this loss function, what we get is a value for SI and TJ. And we were getting one value and the problem with this is that a point estimate is not very good for this problem because of a few reasons. First is the data is very sparse. What I mean by this is so Amazon scale, there are like tons and tons of transactions, there are several tens of millions of transactions even for a very small subcategory. But the problem is if I want to make inference about a particular customer size or a particular child product size, there are very few transactions. So that's what I mean by sparsity. So there's very little data to infer about one point. On top of that, that data is very noisy. And we saw why the data is noisy. So when we look at both these things, a point estimate is hard because if you don't have, if you make a wrong recommendation, it could actually increase the number of returns instead of decrease. So you don't want to make a recommendation at all if you are not confident about a particular inference. And that motivates us to actually go for a Bayesian approach where instead of finding a point estimate, we want to find a distribution over these unknowns. In our case, the S size and TJs are the customer and product sizes. We want a distribution over these. So that's how Bayesian methods work. And you saw that a little bit today morning in the top on uncertainty in AI. So essentially we have like, you know, the way Bayesian methods work is we have some prior which is what we think is the distribution to begin with. So for instance, I might think that for my product size, it's a Gaussian distribution centered around the catalog size. That's maybe a natural starting point, right? So I have a prior on what this could be. And then as I see more data, I want to refine this distribution and hone it into something that is more reflective of the data. And in this process, the black thing is what I would maybe see after I see a lot of data. And I would expect this to be, you know, if this is peaked or if this is, you know, a pointed, this distribution is pointed, maybe I can make a more confident estimate from it. So once I have a distribution, I can take maybe the mean of this distribution to say, okay, that is my value of that variable that I'm trying to find. Like in this case, customer size or product size, right? So that's a basic idea. And we tried to do a model for this. Now the intuition is pretty similar to what we had earlier, but we have now a probabilistic formulation of the same problem. So now we have, suppose we fix TJ the product size and see the distribution over SI. For a fit transaction, we would see that the SI is very close to TJ, right? So the mass, the probability for SI is peaking around TJ. That's what you see. And similarly for a small transaction, we see that SI will be larger than the TJ and so the mass is all to the right. And for a large transaction, SI is smaller than TJ, the mass is all to the left, right? So this is very similar to the intuition that we had earlier. These transitions are not like binary, so they're not like sudden zero one transitions, but they're smooth transitions. And a smooth transition from zero to one can be realized with a logic function that you would all have seen before, possibly, right? So these are just smooth transitions which are modeled using a logic function. So now, okay, once you have this intuition, right? The way a Bayesian model works is a Bayesian network which we used in this case, you write down a generative process which is how you will generate the data, how you will generate your variables, random variables in this case, and the data using this intuition. So you define your mathematical model like that and the inference process is now given the data, work backwards and find the values of the unknown variables. So there are two steps here. So the first process is to define the forward process which is the generative process. So the generative process would be first, I would generate my random variables si and tj from Gaussians in this case. This is nothing but my prior distribution, what I think I know about them. So in this case, these are just Gaussians with the mean as catalog size for my product and for the customer, the mean is just the average of all the transactions of a customer, maybe, sizes that are bought by the customer for fit transactions. Right, so I have this prior and then I define how a data is generated, which is given a customer and a product size, how do I determine small, fit or large, right? So that's how a particular transaction can be generated. So for that, I would just take that intuition that I had earlier, like I had all these different PMFs and PDFs and I would just adjust the position of this is what you see. So I have defined a probability PDF here, which at any point, all these three colors will sum to one, probability of small, fit or large. So it's just a combination of four logits. So let's not get into the details of how this is defined, but this is just the intuition and this is exactly what it is. So now the idea is to find the posterior distribution, which is what is the probability of, what is the, after looking at the data, this is what my prior will evolve into, right? So now to do that, it is not easy to do, so approximate inference is what people end up doing because it might not mathematically work out to do exact inference. So find the exact mathematical form of the posterior. So people do something called approximate inference and common forms are variation inference and Gibbs sampling. So you actually saw variation inference quite a bit today morning and pretty much the same technique, right? So we used variation inference. And I'm not giving the details of inference here, so you can see that in the paper again. So, but we actually evaluated, we worked out the math for this and came up with a simple algorithm, some closed form updates. And we coded it up in Python and memory requirement was just a few GB and runtime was pretty quick and the configuration is given here. And we saw that this was doing pretty well. So if you wanna see the scale of the data, it was again, like I said, tens of millions in each subcategory. That's a number of transactions. So we have like A, B, C, D, E, F are different types of subcategories that we have. And the baselines that we tried, the different things that we tried actually were the first thing to try here is a simple classification technique. For example, if I put some features, for example, I put the catalog size and the average size purchased by the customer and any other features that I want. And I do a classifier as small, fit or large. That's the simplest baseline. That's the simplest thing I could actually do here. So we tried that, of course. Now we actually tried the loss function approach. That is, and the classifier that we used, we had like couple of classifiers. We used like a random forest classifier, logistic regression classifier and so on. We used the loss function approach and then we tried the Bayesian approach and the Bayesian approach was performing better in terms of AUC. Here the ratios are shown over like the least performing technique. So this was the dub-dub-dub work, 2018 work. So again, to summarize, we've addressed the size recommendation problem here, leveraging purchase and return data. One thing to remember is that in our situation, it made sense to use a Bayesian approach because of the sparsity in data and the noise in data. So we found, instead of finding point estimates, we found distributions, right? And finally, we did a variation inference algorithm and computed all these updates, hand computed all these updates and coded them and that was actually very fast and efficient for us. So, but before finishing up the talk, I want to add that this is actually very tough not to crack the size recommendation problem. So while we actually did do the Bayesian to get confidence estimates, the fact still remains that the data is very noisy and we might not be able to make estimates, recommendations in a lot of cases, right? So I think more brains are required to solve this problem. This is, we've just taken the first step in this problem. Right, and yeah, the next time you buy a shoe, hopefully they'll be the right fit for you when it's launched in India. I think, any questions? Yeah, just maybe one question. Hi, I'm Amit, thanks for the talk. So I'm taking myself as an example user of a website, which is not a typical because I don't buy too many things online. So generally when I go to a site, I look for the size and once I buy a product from that, suppose the size doesn't fit me, right? So that leaves a very negative taste in my mind, right? So usually I am probably not going back to that site anytime soon, maybe I'll go after a year or whatever. And again, I'm repeating that I'm not a typical customer. So in that case, you don't get any more data out of me, right? And my feet might be really weird, right? So how, like somebody else buying and returning it back, how does that actually help a specific customer who is actually not giving out too much data? Okay, so there are a couple of things here. So suppose, the first thing is suppose I don't have any data about you. That's exactly the reason why I wanted this beige and things so that, you know, I don't want to make a recommendation if I think I cannot make a good recommendation for you, right? The second thing is this question that you asked is like, how does somebody buying something in a different brand help me, right? So suppose you bought, so suppose somebody bought Nike size seven and Reebok size eight and kept both of them, right? So you know that maybe, you know, Nike size seven is maybe similar to Reebok size eight. I found different sites. Yeah, we can't do that right now. Yeah, so we're not handling that right now. Maybe there's a way to, you know, get that data also, but that's something that's not being done right now. So the answer is that, you know, if you're not confident, better not to make recommendations. So that's where it is.