 about how we are applying deep learning to create SKUs. So SKUs are stock keeping units to clean fashion commerce catalogs. I know that I'm standing between you and your lunch. So I'll try to keep it as much interactive and interesting as I can. So the first question to you folks is how many of you have ever bought an item online? Items like tops, t-shirts, jeans, or shoes. Oh, wow. Almost all of you. How many of you actually work at any of the online commerce portals, like Amazon, Flipkart, or Mintra? I see fairly a large number. So what I'm going to do is jump directly into defining the problem as to why do we have to create SKUs for fashion commerce by showing you a few examples. So when you sit down to search and shop, you have some intent in your mind. And you type some intuitive search queries. So this is one of the famous Indian fashion commerce portals. And I typed here Indian wedding dress. And it is not showing me any kind of results. So you might ask what could be the reason. I also don't know. But one of the reasons could be that none of the dresses are actually tagged with this tag of wedding. Now, I type one more query. I type here evening party dress. And what is being shown to me is a footwear item. So there is complete mismatch between what I typed and what I have received. And this is not just a problem with one or two commerce portals. This is a systemic problem. So I take the same search query. And I type it in a different portal. And here I get belts and house dolls. Most likely the search engine is working fine, because it is picking up those keywords party and evening from the text. But it is just that the dresses are not tagged with either evening or party. Now you might say that I'm typing very generic search queries. So let's try some specific search queries. So here I typed men sports t-shirts. Now these are the top eight results that I get. How many of these are really sports t-shirts? So I asked a fashion expert. And she said these four. So these four are sports t-shirts. Now if you look at the second item, the label is arrow sport. So the search engine is picking up this text and showing this item as a sports t-shirt. But in fact, it is a casual t-shirt. And there are more casual t-shirts in the results. Now I can take a very specific search query. I type blue tunic. Now a tunic is a very popular fashion wear for women. And I see these top five results. And if I analyze these results, only the first result is correct. Rest four are not. And in fact, I see some dresses which are in green, are not even in blue. And if you see that the last two items are actually duplicates. So the same item is being shown to me multiple times. Now why do these problems exist? These problems exist because fashion commerce catalogs are not very well tagged. And we saw a few examples. So there were missing tags. There was a mismatch between what you type and what you got. There were non-sunder tags like casual versus sports. And you also have lots of duplicates. So the next question is, why are these products not well tagged? So there are primarily two reasons for it. A, the tag creation process itself is very complex. Fashion is one domain which has very high dimensions. There are lots of patterns and styles that keep on changing over time. And on the top, there is no single standard that you can use to actually attack these items. And hence, because the ecosystem of commerce is distributed, different sellers apply their different standards to attack these items. So let's understand these reasons some more. So why is a tag creation hard? So I have shown here some 15 dimensions in case of fashion. I don't expect you to read through all this text. But the idea is that you can imagine the number of combinations that you can have if you have so many dimensions. For each dimension, you can have, on an average, 10 values. Now imagine a seller sitting and uploading one fashion item in case of there is no standard reference. Now it is not the case that standards don't exist. So we have ISB in case of books. We have UPC in case of physical retail. And these help both consumers and businesses to track items and to also search for items. So let's take example of ISB for books. When you have to obtain ISPN in appropriate form, you first have to enlist the author, name of your book, the edition, the country, and there are several parameters. And only after that, you get to see a number which is unique, which you call ISPN. Now as you can see, because you have these attributes, it makes it very easy to index and search for books. Now it is not the case that in case of online commerce, there are no unique tags or unique IDs that these portals don't apply. So in case of Amazon, you have this Amazon standard identification number, which is nothing but SKU. SKU, as I mentioned, is a stock keeping unit, which is essentially a product code that you assign to unique products for your internal use. Now unfortunately, when you have to onboard this fashion commerce inventory, this process of identifying different attributes is not followed. If that were the case, we would have products that are very well tagged. So let's further understand why this problem is more important from the commerce perspective. Now I have shown here a simple value chain from manufacturers to users. So manufacturers manufacture fashion items, which are then sold to sellers and brands. And sellers and brands then upload this inventory into commerce catalogs, which is then exposed to us users when we search for it. Now in case of direct e-commerce, the commerce platform has very tight control over the sellers. So you essentially create brands. Mintra is an example of a direct e-commerce. In case of online marketplaces, you want to have more and more sellers onboarded onto your platform. But you don't have very tight control over them. You can onboard small, medium-sized, and large sellers onto your platform. Now for both direct e-commerce as well as for online marketplaces, the way this inventory is onboarded is not at all. So there are no standards maintained. And because of which, you see all sorts of variations in your catalog. And your catalog is very poorly tagged. So why is it a problem for us? So we act as aggregate over the existing direct e-commerce as well as online marketplaces. And we want to provide a very good consumer experience. And we want to do that. It's your laptop. So you're getting some email? Sure. So we want to provide discovery and search experience to consumers by having a very good technology. And towards building that technology, we want to ensure that the catalogs that we onboard, these are very well tagged. And you can see that because we have to onboard inventory from a variety of people, we have to have a common standard, which we can refer to when we have to attack these products. So in other words, we need a magic box which can create these SKUs for us. So we actually went ahead and built this magic box. And I'm going to show you a preview of what would happen if you have this box. So I'm going to take one apparel category. That is women's tops. And I'm going to analyze inventory of two portals, portal S and portal M. Now, when we analyzed their inventory and looked for how many products have missing or incorrect category, which actually means that tops are not marked as tops. And we saw that about 15% products from a portal S and about 7% from portal M had this issue. Now, when we filtered for unique products where there are no duplicates and the naming is consistent, we saw that there are only 65% of the products that are well tagged in case of portal S and about 86% in case of a portal M. And when we applied one more criteria where we say that, I'm going to say that a product is well tagged if there are more than two attributes that are well tagged. And we see that the number has jumped down to 45%. So these attributes are like pattern or length or sleeve. So clearly, we have a problem. How do we curate such fashion catalogs? And how do we create SKUs? So creating SKUs involves two steps. As I've been mentioning, we have to have a standard where we have defined these 15 dimensions. And we have to have a method by which we can identify these dimensions automatically for an incoming product and tag the products appropriately. Now, here is a pipeline that we have at our company when we have to create SKUs. So we take data from different commerce portals. We put it into a staging DB. Most of you know that a product page has text and images. So text has title and description, and you also have images from front, back, or side. So we deeply pass this text and image, and we normalize them. We normalize them into 15 dimensions, which is now acting as a standard for us. We also compute a feature vectors for the primary photos. And after doing this, we now have accurate tags. So now we create SKUs. And only if we find a product to be unique, we actually put it into our DB. So this is our production DB. So for the rest of my talk, I'm going to focus on this block where we are going to do text and image passing. So to build this block, we wanted to have a solution where we have to do as less feature engineering as possible and also have as less reliance on HTML tags or passing as possible. Because these two things will change, and in fact, they change pretty quickly. Being a startup, we did not have luxury of offloading the task of getting a lot of data tagged manually. And when we are building this solution, we want to ensure that the machine learning product models that we build, those are properly calibrated. So what I mean by that is, if a model is predicting something, it should better be with very high confidence. So what are the challenges in building such a robust solution? So let's say you want to do identification of category by images. So you'll be sometimes quite surprised with the variety of photos that you see on a fashion commerce portal. Now these are three different products. Nowhere in the text of these products, it is mentioned that whether it's a single item or a pack of items. Now how do you process such images? Can you simply train a simple CNN classifier on the top and then get done? I don't think so. So let's say you want to identify attributes using images. So you want to identify attribute like sleeve using images. You look at the amount of variety that you have in the poses. So you and me can easily identify that it's a full-sleeve formal shirt. But if you look at the variety, so first of all, you have to figure out which of these images you have to parse and then see whether you can really identify the sleeve type or not. Now let's take one more example of identifying a category from images. So these are three different items. And we asked a few online women shoppers to mark them with category. Now 60% said the first item is top, 40% said second item is a tunic, and 50% said that the third item is a kurthi. Now how many of you think that, let's say, third item is a kurthi? Can I have a show of hands? Actually, yeah, I can understand the reason, but yeah. So when we asked our fashion experts, she said that the first item is a tunic, second item is a kurthi, and third item is also a kurthi. The reason I'm showing you these examples is because you can see the complexity. So if you have to train a machine, what kind of features and where should that machine focus so that you can execute this task of identifying categories? So let's also quickly see a few examples of text processing. So here I have one product where if you see that title contains checked short kurthi. And if you see the image, it's neither checked nor shot. And if you see this example, the title contains both t as well as top. So you have multiple categories. Now where do you bucket this item? And especially here, if you look at the photo, you don't know whether it's a t or top. So again, there is some kind of confusion whether to call this product as t or a top. So my reasoning behind showing you all these examples is to convince you that it's a really hard problem. You have to do both image processing and text processing. And if you have to do image processing, you have to solve all these problems, whether it's a single item or a pack, if you have different poses, if there are confusing categories, and there are some more. In case of text, whether you have a single category or if you have multiple categories, then the categories present in the title or in a description. And just doing these two blocks is not sufficient. You have to have confluence of these two blocks. And if you can actually figure out a solution, there is potential for cleaning more than half of the fashion commerce inventory. So I think you had some question. So if you go to more, yeah, sure. So the question is that the product images that I had shown, those had a wide background. So my answer is that there are a lot of niche designer portals where you see those kinds of images, where the background is not white, but let's say the photo is taken in a park or so. But in most of the major fashion commerce portals, you have images which have a wide background. So for the rest of my talk, I'm going to focus on one problem, problem of identifying a category tag from text and images of products. So here is a very simple and intuitive algorithm that I have. I have here three cases. Case one is that, let's say, I'm not able to figure out any kind of category from the text. So I'm going to take care of image processing. If the confidence of my image processing model is sufficiently high, I'm going to mark that product with the category that my image processing model is saying. Otherwise, I'm going to filter that product. In case of a case two, if I have a single category in the text and if that category matches with what my image processing model is saying, I'm done. But if not, I'm going to go back to case one and solve this problem. Now in case of a case three, where you have multiple categories in the text, be it in title or a description, we have to figure out a way in which we can rank these categories. So you have, let's say, three categories from title and the description. So you want to rank which one of these is most probable. Now you can also take help of images. And if image has very high confidence, you are done. If not, you are going to take help of the most ranked text and most ranked image. And then you are going to mark the product with appropriate category. So this is the pipeline that we have for image processing. We first do back detection and then extract the item on the top. Then we do pose a detection. Then we apply a segmentation model on the top. Now for the segmentation model, you give image as input. And as output, you have an array of confidence score for each pixel. So let's say you have 30 categories. So for each pixel, you are going to get an array of 30 values. So this segmentation helps us in terms of clearing out the background, which goes back to your question. And it helps the further blocks to focus on the foreground. So here I have segmented the top where and bottom where. And then we take the segmented image and then we apply a scene-based classifier on the top of it. And in the end, we have a softmax module which gives us confidence scores. The same pipeline we can also apply to extract attributes from the images. So now I'm going to talk about these two blocks, image segmentation and image classification. Now image segmentation and image classification has been receiving a lot of attention recently because of advances in deep learning. So I have already explained the problem of image classification, sorry image segmentation. In case of image classification, you simply take image and you want to identify what are the confidence scores for the images. So we take the segmented image and then we give it to a convolutional neural network. And as I mentioned, we have softmax at the bottom, which has about 30 categories. So we did a few experimentations. We gave about 2K images per category for these 30 categories. And we received accuracy of about 88%. But this is not that good because if you see the average confidence, if I take the topmost category and if I take the probability and if I compute the average, it was about 79%. So ideally, you want to have this number to be very high. You want to have this number to be something like 95% because you want to have very high confidence when you are predicting something. So how do you solve this problem? We increase the number of samples that we have per category. And we see some marginal improvement in case of both accuracy as well as confidence score. And then we increased the depth of the network. So now we have 5K images per category. And we have about eight layers. And now we see that the average accuracy has increased from 79 to 88. Now you can apply some more tricks on the top. I will leave this problem here. And we can also talk offline. But with a few simple tricks, now we can bump this confidence score to about 95% or so. So now we know how to do image parsing. So we have a box where we can put an image and we know the category of the product with very high score. How do we solve this problem of text processing? So as I mentioned, on the product pages, you have title and a description. And we make use of this kind of separation. We apply a simple pipeline to extract categories from the text. So if we see that there is more than one category in this text, then we are done. But if you have more than one categories, then we have to figure out how to rank these categories. So here again, we apply a very simple and intuitive technique where we have the product text on one side. We have category on the other side. We compute feature vectors for both of them. And we want to know what is the semantic gap between the two. So I'll show this by a diagram. So here we have product description. We have the category that we want to rank. And we have two feature vectors. And then in the end, we're going to apply a softmax. So let's go through it step by step. So for a product description, I'm going to have a 2D array. So the product description contains words. For each word, I'm going to have a column vector. And hence, I'm going to have a 2D matrix. Now you can apply a CNN-based techniques on the top. And you can get a feature vector. See, you can use a word 2-vec. You can just. Yes. So these column vectors are essentially taken from word 2-vec. You can also apply character-based embedding. So you can talk offline on that. Now using a word 2-vec again, you can have a column vector for a category. And then you can learn a similarity matrix in between whose output then you give to a feature vector, which has now embedded knowledge of the category and the description. And now you train a fully connected layer with softmax on the top. Now the output of softmax is whether these two entities are similar or not. So it only has two classes. And now you can give a pair of descriptions and category as input. And you can train this whole pipeline. So this is a very generic pipeline. You can replace the CNN box by, say, LSTM. Or you can replace the softmax by, say, contrastive loss-based function. But this pipeline has worked for us so far. So we gave about 12k pairs as input. And we received mean average precision of about 86% and mean reciprocal rank of 92%, which is actually quite good. If you give about 5k pairs or so, these numbers will be quite low. So now we know how to process text. And if there are multiple categories, how to take care of them? So we can go back here. We can rank the category by the text if there are multiple categories present. And we can solve the case 3. Now this algorithm itself is not robust. So it can also make mistakes. After all, we are making our decisions based on your confidence score and some kind of thresholds. So you have to do some kind of sampling analysis. So this is important because let's say you have normalized a 2 million products. How are you going to go back and then check whether your algorithm is working well or not? You can't possibly sit down and go through each of the products. So we do a very simple sampling analysis. And if this sampling block says that I have a sufficient confidence, we promote these models to production. Otherwise we'll have to go back and then give more data or increase the depth of the models and then retrain. So let's just go through a very simple sampling analysis. Let's say you have normalized 1 million products. And your sampling is going to consist of a bag of 25 products. So you essentially count how many of these 25 are correctly tagged. Now your hypothesis is that I want to have my average accuracy of a sample to be higher than 95%. And you do sampling, you take 50 samples, and you get average accuracy of, let's say, 93% with a STD of 6%. So how do you go back to your hypothesis and then say that whether I can say that my products are sufficiently well tagged or not? So in other words, you want to have this probability that what is the probability that the average accuracy is less than 93%, given my true accuracy is less than 95%. And if this block, if this probability is higher than 5%, it actually means that that is very significant. So I can expect to have a lot of products which are not well tagged. So you can do a simple T test. And for this particular example, this probability comes out to be 1%, which actually means that this probability has a very low confidence. So it will happen very rarely that your sample will have average accuracy which is less than 95%. So let me start concluding my talk. So we did all this to create SKUs. So how do you create SKUs? As I described, you have to have a standard for fashion commerce catalogs. And you have to have a method to populate different dimensions for each of the products. And now if you have this kind of product, you can create SKU like the category shirt. It has check pattern. It has polo neck, full sleeve. It has red and black colors. The brand is absent. But you have a 256 bit feature vector. So this is your SKU now. Now given you have such SKUs, what would happen to these numbers? So I had shown these numbers at the beginning of my talk. Now this number went from 45% to 84%. And by the way, this is for a woman's top for a particular portal. And you can also see that the algorithm is filtering out a few products because we have lots of else cases. So if text or image doesn't have enough confidence, you will have to filter that particular product because you don't know whether that product would be well tagged or not. But these cases are rare. So how does it help in the end? You can have improved visibility of catalogs. You can improve your search results. So the search results that I showed in the beginning, you can expect to see a lot of improvement in those. You can have hence improved user experience. And you can also use SKUs to recommend, let's say, similar products. Because you remember we had a feature vector. So using feature vectors, you can also figure out what are the products that are similar. So to summarize, I hope I have convinced you enough that it's a hard problem to curate a fashion commerce catalogs. I also showed you different pipelines in terms of how you can go about building SKUs and building image processing and text processing pipelines. And if you can actually build SKU a blog, you can hope to see a lot of improvement in the curation of the catalogs. So if you have any questions, I can take them now. The mic is here. Can just speak up. I'll repeat the question. See, in a way, we are now acting as a fashion commerce portal because we act as aggregate on top of them. So anyway, we have to have unique products in our database which are basically SKUs. Excuse me? Yeah. This product is representing I need to segment for the top way rather than the bottom way. So I showed a few simplified pipelines. So it's not that you discard the text altogether. So text acts as some hint for you to focus on whether there is upper bodywear that is prominent or lower bodywear. So you can also do text and image embedding. You can have a model there. And then you can figure out this problem. But most of the photos have either upper wear or lower wear, which is very prominent. And one more just small question regarding this how did you come up with the training data for the after image you have to classify it into a category? So that training data is something that requires manual intervention or is that also something that can be pre-processed? So what we did is we did this in iterations. So we initially had a pool of samples that were manually tagged. We trained a model. Then we used that model to tag some more products. And there now the choice was if the model has sufficiently high score, if you filter them out, those that don't have high score, you have to give them back to tag manually. So I just answered that question. Correct. So I just answered that question to repeat that answer. You start with small data that has to be tagged manually. But you can train a model on top of it and you see how do you make use of that model to tag some more data. So let's say you have 100 photos which are tagged. You trained a model. Now you apply it on 1,000 photos. Not all of them are going to get well tagged. But you can apply some tricks there and then say that I don't have to look at 800 of them. I just have to look at 200 of them. So then you can give this 200 to manually tag it. So I'm not making a very hard statement here when I say that a CNN can't do it. Of course, it can do it. And you can also apply some simple heuristics. Like you can do phase detection or so. But you're looking for, see, that's a heuristic. So you can have a very complex image, a complex kurthi, which has lots of patterns. And if you just look at a detection, you don't know whether it's a single item or multiple items. You can also have saris, which are some parts of saris are very well flowing. So you don't know whether it's a one saris or two saris. So my, sorry? Sure. Just jumping in while we still have an audience, and continue the Q&A after. So don't forget the drum session this evening at half past five, a little bit of fun in between serious stuff. And second, please don't set up your private hotspots. The IT people have said that it's interfering with the network here. It's a pretty fast connection. It's fiber. So just log on to the Hasgeek network, and you'll have fast internet. Thank you. So I have a question regarding the segmentation you did. So it was a detection problem. Or it was actually a segmentation problem where you detected the exact boundaries, and how did you achieve those segmentations? See, the segmentation problem is both to do segmentation as well as to mark each pixel with a category. So it's a segmentation as well as a classification problem. But when you have a lot of complex images, you see island of segments. It's not very easy when you have a very complex pattern to have a single segment for the upper body where or for the lower body where. Now, it's a very hard problem to combine this local information of segments and produce a single or global information. And hence, you require CNN on the top. But how did you achieve those segmentations? So there is a very nice paper, PassNet. It was released last year, I think. You can refer to that paper. That source code is also open. I wanted to get a sense of the scale of data that was required to get this classification to work. In the sense that you had mentioned that you needed 5,000 images per category to achieve a confidence interval of 88%. But how many categories were there? In a sense, can you give me a rough sense of the scale of data that was required? So we had 30 categories. As I had mentioned on my slides, these are mainly apparel items. So we had a classificate. So we had a taxonomy. And we had 30 apparel items. And for each, as I mentioned, we had 5k images. Hi. Can you tell us more about your duplication removal tool? So now, after the normalization of all the attributes, so is it just a simple group buy off those attributes? Or is it a separate ML model classification model? So in the SQ part, I had both category attributes as well as feature vector. Now there are different rules that you can come up with. A very simple rule is that you can do a feature vector based search. And if it matches exactly, you are done. But what happens is that these sellers are very smart. So they can essentially crop the same photo and then upload. They can also change the size and upload. So essentially, you can see for a few products, we have seen there are 200 duplicates. So you have all these variations. Now there you will have to have some kind of margin. So you say that if my feature vector matches between this margin, then perhaps it is a duplicate. And there you can also make use of the categories and attributes. Because most likely a seller is going to use the same text. So there the other attributes are same. It's just that the feature vector has changed because there is variation in the photo. Sorry? The text classification process. Can you explain the text classification process a little further? See, I can talk about the problem that I mentioned. So the problem is that in your text, so let's say you have 10 sentences in your paragraph. And you apply a simple text parser. And you figured out there are two categories, top and t-shirt. And you want to identify whether this text description belongs to either top or a t-shirt. And now what you are doing is for this text description, you are computing a feature vector. You also have a feature vector for both top as well as t-shirt. So now you have two pairs. One feature vector and a top feature vector, the same feature vector and feature vector for a t-shirt. And you want to compute which pair is actually semantically closer. Yeah. Hi. Hi. Where are you? Yeah. About the attribute extraction, you haven't talked about the attribute extraction. And besides that, even just for the article types, as you said, you had 30 article types. So for the themes and the attributes, also you might want to tag it along with the article types because some attributes are specific to some article types. How do you go ahead with that? And the second part, you started with an example of wedding dresses. So how do you infer that for particular dresses of wedding dress or not? So to answer your first question, it's not like you have to automate this process end to end. So for upper bodywear, you will have some attributes. So those same attributes won't apply to lower bodywear. So you can do such simple tricks. And then when eventually it boils down to identifying attribute of an image, you can take help of all these things. So that's one. And secondly, once you have these fine attributes, you can have some simple rules. So for example, if it's a check or a lining shirt. So let's say the pattern comes out to be a shirt. So it's a shirt and you have checks or lining. So most likely it is office wear. So you can encode some simple rules on the top and then attack this product as office wear. So there is no learning here. So it's very simple. You don't need to have learning when it is not required. OK, we have one last question. How do you? So yeah, I will be available here. So you can take questions. But maybe I can take the last question. How do you handle conflict between the text category output and the image category output for the same combination? Conflict. So you mean to say the text category is not matching to the image category. So there we simply assume that the text category is wrong. So that's our heuristic. Because most of the times we have observed that the text categories are not correct. So we want to give more emphasis to image because photos are the ones that users are going to see. So it makes sense to have a category extraction from the photos instead of having from the text. So whenever text and photo doesn't match, it will say we'll only refer to image. So I can take the questions offline.