 Okay, let's start. Thank you everybody for being here. Welcome to this talk. We are talking about product matching in the fashion industry. We are Amalicia Perez, he's Javier Ordoña, and we both, oh wait, one second. I need the Sorry. Okay. Sorry. Yeah, I was saying that we are both data scientists at Stilesate and let me introduce a little bit what Stilesate and why are we here talking about fashion? Stilesate is an artificial intelligence power analytics tool very focused on brands and retailers and they and this platform helps them to make strategic decisions based on artificial intelligence and mainly focusing four type of analysis. First of all, it's pricing. We help them to see in which prices are their competitors and what are really their competitors. Another analysis is the assortment. We help them with the price, with the sizes and the products available or not and again to take strategic decisions. We help them with the promotions, the marketing activities that their competitors are taking and we also help them with the trends, with the influencers and all the social media that now is so relevant for the fashion industry. And Javier and me as data scientists, our role in the company is mainly work with the data and reach them and try to generate new information. Luckily for us, Stilesate has more than four years of life and we are collecting data since the beginning. So we have a very huge database with many products. We have more than 400,000 million of products with different languages across different countries with texts and images and we are increasing our database each week in one million more products. So we take this information, we apply different machine learning strategies and pipelines as image recognition and natural language processing, time series with a big tool of things in data science and with this we build the platform that we sell to our customers and they can take all the decisions that I told you before. But here we are talking about product matching and let me introduce the concept with the Bill Gates prediction. In 1999 Bill Gates put some thoughts in a book. The book was called Business at the Speed of Thought and in this book he thought about the future in the computer science in general and the technology and back to 1999 it's extremely surprising how accurate this prediction was. For example, he introduced some things that now are really normal for us, but back to 1999 it was totally undiscovered like the idea of a smartphone, the social media's Facebook, or the travel website for example. And for me it's extremely surprising that the number one of this prediction was this, let me read it to say exactly the same word, it was something like automated price comparison services will be developed allowing people to see prices across websites, making it the forlest to find the cheapest product for all industries and this was in 1999 and this is very related with our product matching. Product matching is something like this, you have the same product in two different websites, this is a real example in two different websites and you need to know that it's exactly the same product and it seems easy because we as humans have the superpower to infer features, to infer characteristics of images mainly. So for me based only on the image, I will say that this is the same product and why? Okay, because for me the pattern is more or less the same, I don't know if you can see it clearly in the image but it's a genes pattern with dots and yeah, I can imagine where are the hips of the model and the knees, so I can I can say that is more or less the same skirt length and I will say that more or less the shape is the same because I can I can imagine that is more or less fitting the hips and straight in the in the lowest part, but actually for an algorithm based only in the raw image it's not so easy because the color actually is not the same, the hexadecimal code probably is not the same, the shape is not the same in the image and in the image I cannot see clearly the whole body, I can I can imagine so yeah, it's not so easy and if I try to help myself with the text, I'm not it's not very helpful because the only matching text in these two products is s.oliver that actually is the brand, it's not related with the skirt itself it's not saying nothing about the skirt and the other text is in different languages and it's not yeah, it's not very descriptive with the skirt but why is it important? Yeah, the first idea that we have in mind is this as users as customers of the of the web page, I want to know in which website is cheaper to in order to buy it and we can see that they have different discounts and different prices even the original price is the same but in size it will work with brands and retailers, so for the brands and the retailers, this is really important not only in this moment they want to know the price strategy along the time, these are charts from our platform and you can see that it's not different only the price today the price strategy is actually different in these two websites in the first one we have this product more or less discounted along the time with no very big discount but in the second one it was full price almost all the time and in the last weeks they have an aggressive strategy for the price, so they need to know this in order to to be able to react and another goal could be the assortment another of our analysis it's you can see that this product is not available in all the sizes but in the first one we only have two sizes available with a very few products and in the second one we have available all the smaller all the smaller sizes, so yeah, that's how many is different and the third one is related not with the competitors, but in the inventory organizations for example if you think in very big retailers as Walmart, Target, Asos they have a very huge inventory with many products that are provided by different brands by different providers and each brand probably is is providing the data with different formats with different fields with different ways to serve the information and they need to know if one product is actually in their inventory before including because if not you will find the same product in the web in the same website with different information with different price and this will be confusing, so yeah, there's a lot of Benefits of product matching and maybe you're thinking okay, but this is a problem that is generally in all-day commerce business and it's more or less solved with the universal codes and Yeah, for example in technology, this is very common, but in fashion it's not so common It's very hard to find these codes in the websites and let me show as an example This is a TV This is exactly the information that I can find in a TV in the brand that is That is doing this TV and the first information that I find in the website is the code And if I search for this code in Google I have this product in many websites to buy and even in Google Shopping But in fashion, it's not so easy They try to hide these codes in order to avoid this type of analysis from their competitors And one warning without continue We are talking about product matching. We are not talking about product similarity product similarity It's other thing is try to find based on one target products with similar Features based on some In some things it's a ranking of similarity, but we are not doing this This is a very useful use case for for all the industry in e-commerce and especially for fashion But you need another approach for the solution and you have other benefits and we are not talking about it We are trying to find exactly the same product. Okay So yeah, let's start with the solution I will start with the first step that is prepare the data to be able to compare Okay, the solution will be based in take taking two products and let's try to figure out if it's the same And if I want to compare something I need that these things will be comparable. I need to compare apples with apples and What I'm talking about. Yeah, the information as I said in the website is An instructor they have different Information available in different formats. For example, this is a product in Sarah China This is a product in one retailer in Europe and for example here You have very few texts as you can see in these three examples. The images are different for some products You have many images for another you have only the product you have the model in others So yeah, we need to make this information Structure in order to be able to compare and We can start with an image based classifier I can take all the images and I can try to Define a standard taxonomy. I can I can define with the business team What is addressed? What is a shoe? What is a jacket? And I can try to put all these images inside of this category I can try to fit the products based on the image For this we are using deep neural network. I know that we have been talking about This topic a lot in this in this conference. I will only say that deep neural networks are some piece of so of yeah of code Organized as we think that our brain works They are organizing layers and we have small pieces of So we're calling neurons that actually are nonlinear functions They are connected one to each other and the training process will do that that these connections Will be a strength if they are lead to success out to solve our problem and weak if they are not so in the end we have a distribution probability and we can use this probability in order to Classify our images and classify our products. So let's say if I have this Image I can classify them in the base category and I can say that this address But I can use this approach for different taxonomies in order to have more information to find the match I could say that the color is red the print is the pattern is a floral the slim left the necklace The I don't know if you have ruffles or not I mean you can struggle this information and you can do this with this image and with the other images that We show before so we have the same attributes for all and now we can compare so for example if I try to Find in a very naive Wave these two images are a match I can use this approach Struck all the attributes that I need and in the end if they have the same Attributes, maybe I could say that actually this is much But it's not so easy Why because in fashion sometimes you have the model with all the outfit and only base in the image I'm not able to say if we are talking about the jacket or the or the pants or the shoes So if I'm not able to say that probably my algorithm will not be able as well Sometimes for us is confusing Yeah, again, I'm not able to say if this is a pant of this. This is a natural cotton Or a pyjama and so probably it will be hard And sometimes we have detail in it that they are not providing many information about what's the category of this product So maybe I could help myself with the text again the text is different with different languages and and different information available so Yeah, maybe I could use as a similar approach I could use a Neural network I could split the text into tokens and I I could try to figure out if some words has Really relevance in some categories that trying to do that. I'm trying to find But the game is not so easy because sometimes I have ambiguous descriptions for example a winter jacket But what kind of jacket a puffer jacket a bomber jacket a blazer is not clear and in the fashion industry there are many ambiguous terms and they are Same spellings for the same for the same thing for example in materials onion sizes This is very common. Sometimes we have to excel and sometimes we have x excel So this is something that we need to take in counting our classifiers and yeah the missing data and inconsistent data It's very common here But if you take in count all these things of these Difficulties that you can find in your classifiers you you could do it and we are doing Okay, and you have in one side the product in the website and in the other side the structure Product as we have in the platform. So we have for all the products a title We have the brand we have the price we've had to buy level colors I mean we structure these data for all the products and we have the same information Well, sometimes you have missing information in some fields, but you can compare you have All the information that you need and now you can compare Okay, so for now we are not solving the problem We are only preparing the data in order to compare and have you will explain as how So now that we have the Products in a standard taxonomy that we know how to extract this usually information We can start actually comparing the product. Okay, so what is problem matching from an algorithmic point of view? So first thing you have to see is what is an instance or an example for our product machine model? In terms of machine learning model, so this is not a good example Okay, as Alisa explained we are not dealing with a product similarity problem here We are dealing with a problem with a product matching So we need product with exactly the same this example in here is a very similar dress It's not the same dress. It's an excellent example for a product similarity problem But it's not a good example for a product matching problem. We need this We need exactly the same dress on two different catalogs. Okay, so that means that in terms of machine learning This is a supervised binary classification task Okay, it's supervised because for every two pair of product We will have a single level and it's binary because these levels is either positive or negative So two products are either are much or not. Okay, there is no point in between So first thing we have to do is now that we have these products in the standard taxonomy with the same dimensions or Characteristic structure we can start actually comparing the products We can compute like Individual similarity score for each one of these dimensions. So for the products we have the title extracted in the same way We have the color we have the material we have the attributes we can start working with that So first thing first thing we can do is to compare the title the text So as Alisa explained we can transform the text into some kind of vector representation And this is a bit of the scope of this talk But the idea is we can use the FIDF or a binary or a dictionary or maybe a neural network to transform the text into a vector And as one we have the vectors we can use some the similarity distance of Distant functions to compare the distance between this vector like the dot product or the Euclidean the Manhattan the cosine We can combine actually these different Distances with the different ways we have to compute the vector and that way we can get a bunch of different similarity scores Just for the title the next thing that we can compare are the colors and in here We need that like a first or preliminary step, which is how long can we get the color? So either we can get the color deeply from the image like taking what is the color density and that way we can have an Existential co-representing the color or we can go to the title or text and check what is the name of the color Okay, usually here what we do is first to apply some kind of normalization Process before because sometimes we have really strange color names like me 90 blue, which is meaning dark blue So and once we have this color transform we can apply a dictionary and we can transform this color name to Alexi Xadecimal code. Okay, so we can extract one or several colors per product And one we have this Xadecimal code Then it comes the easy part and I see CC because the distance between colors has been already solved by the International Commission of Illumination So they have defined some formulas which are based on how humans perceive Distance between colors and we can apply these formulas to see what are the distance between one or several colors That we may have extracted from the product and we can get also a bunch of similarity scores for the color The next thing that we can compare is the material Okay, and this for us are discrete values So what we do is to define positive or negative distance for the material so in our case We divide the materials into clusters or family So our idea is if two products have the same Material in common for the same family that that is a positive distance if one of them has a missing value Then that is a negative distance and you can combine this logic also with the percentage of the material So you have some kind of natural way to define the distance between the materials Okay, that way you can get a similarity score for the materials between the products The next thing we compare are the attributes as Alisa's plane We can get the attributes from the title from the text from different fields and the idea is You can get a bunch of attributes for a single product, but this is not always so easy because I mean Sometime the attribute are not perfect strategy and you can have situation like this in which because of the image You are maybe the algorithm is not sure about the length of the skills Or because of the position of the arms the algorithm is not sure about the length of the sleeve So you may have attributes who are not been Extracted in a correct way, so you may find this situation in which you have two products Which in principle they are matches, but you have some attributes which are not matching Okay, so what we do in this case is to define a Positive distance for that which are matching and a negative distance for those which are not okay And this can be also combined with there's some kind of natural Certain heuristic that we may have for the attributes I mean for us as humans kind of intuitive to see that Address with a long sleeve will be closer than to address with a short sleeve rather than a less address with no sleeves Okay, so we can take advantage of this to define What will be the distance between these? attributes and also combine that with the Confidence that we may obtain from the algorithm to define What is the distance between the attributes and that way we have a way to compute similar discourse for the attributes, okay? And the last component that we can or the last dimension that we can compare are the images And this is a very complex problem Which is a bit of of the scope of this? Talk, but what we can say is one of the solution to deal with this is the same is neural network Which are the specific way a specific type of neural net and did ruben martin and two talks ago They talk about how to transfer image into embeddings to compute the distance Basically, we are doing the same thing in this I am is neural net instead of having as input Yes, one image you have two images these images are processing parallel at that Then you get an embedding then you have a set of additional layers To compute what are the distances between those embeddings? So the result of this whole process is a score a number which is a similarity score, okay? this Algorithm has to be trained like an independent machine learning model So you need to define your training data train the model evaluating independently So one and so forth and at the end you have a system which is able to compare images and you get a similarity score So when you have the similarity score for all these dimensions You can combine all of together concatenating what we call a similarity vector and the similarity vector is a pretty good approach About how close to produce maybe, okay? I think we are dealing with a supervised problem What you have to do is to associate a label to this vector So in the case in which we have a positive case of a match a positive match the label will be positive in the case We have a negative match the label will be negative, okay? But this is the perfect case the case we have just seen is the case in which we have the product perfect extractor And we have all the dimensions, but this can also happen Okay, we may have produce where the image is not present on the website Maybe the color cannot be a structure the material is also missing or the attributes are quite poor Okay, in those case this similarity vector will look like this We will have a lot of missing values that we have some gaps and our systems will be ready and should be Expecting to have to deal with this kind of problems. Okay, so which kind of algorithms in do we use to deal with this in our case We use gradient boosted trees Which is a pretty standard algorithm in the industry this algorithm is based on trees Which at least it doesn't like a bunch of rules which relate the feature vector to the label. Okay, and this Algorithm will create like several independent treat like a small trees called weak learners We relate the feature for the label for a specific cases And these trees are combined into a meta tree or victory which is called in symbol model And this model is a pretty good approach for this Problem because it's a very robust model can deal pretty well with the missing features or the gaps Also, it's pretty good deal with a balanced data and then product matching problem is a very unbalanced problem by nature Because it's much much easier to find negative cases of matches rather than positive cases. It's really hard to find matches to train your model so that Algorithm can deal pretty well with that Also is in general is a good first approach for any machine learning problem You will see that in Kaggle is a very popular method And in our case we have using the implementation called edgy boost, okay So now that we have the feature vector and the model We have to think about how to train our model because as I said, it is very difficult to find with matches So what do we do? We apply like a semi-supervised approach or intermediate this step So how does this work? We take an image with several Approach sorry approach with several images like we can find in a website and we take some of these images from the same Product and we treat an image based similarity algorithm. Okay, as I said, I mean We're not trying to solve for a product similarity problem But for this case like an intermediate the step is a good approach to start your model up and running in a cheap way Okay, so one will take these images. We can try an image by similarity problem I mean it's a base similarity all the story and we have something like this We have some products which are similar. Maybe they are not the same But they're close enough to be considered like pseudo matches from those we chose those Which are pretty close to be a match and those will define the positive instance for our model Okay, as I said, this is not going to be the final data. We're going to use to train our model But this is a good approach for something enough to have a first iteration of our model. Okay So now we have the data we can start thinking about how to evaluate this model and Keep in mind the following as I said this problem is a very unbalanced problem by nature and this is what I mean if you have this case in which you have two catalogs With a 30% overlap in this case. We have dresses that means that we have three dresses on belonging to both catalogs Okay, if you try to compute how many different pair with combination we may have that means that we will have Three pairs which are matches and 87 which are not. Okay, this is what I mean by having a very unbalanced problem So if we transfer this into this in which we have Catalog or a set of pairs Where three of them are matches and 12 of them are not matches And we try to solve this using a very bad model and very useless model We're saying that everything is not a match that model will predict currently 12 out of the 15 Okay, if we try to measure this using the accuracy the accuracy of that model will be an 80% That means that the accuracy is not a proper metric for this problem We cannot use the accuracy which is a much better metric for this So what we need is a model saying that this subset are matches. Okay, so from there we can take we can Check how many of those are actually matches and that is the precision and also we can take okay from the matches We detected how many of them will left outside and that is the recall So precision and recall are much better metric to measure the performance of our model rather than that the accuracy Which is useless in our case. Okay, so now we have the model We can start thinking about deploying this model So first thing we have to keep in mind and just consider the following the following situation of this number This numbers come from a real matching process. We are running our company right now so on one half we have we have one retailer with 850,000 products on the other hand we have a retailer with 380,000 products if you try to compute if Every possible pair is either a match or not. So you get ready to compute all the person per waist combination You have that number So if your model is computing whether a pair is a match or no In a 10 of a second that will take 1000 years. So this is not feasible. It's a problem Which is not trackable. You have to do something about this Okay, so what do we do? Going back to the problem before with the dresses what we do is to match only those products We have in common the brand the gender and the category. Okay, that way you can reduce a lot Dimensionality of the product of this problem and you have something like this like a small cluster of product Which are the same brand gender and category and then you have a problem which is trackable Okay, you don't need to do this number of matches anymore So what else can we do when we are deploying this this this model? Okay We can apply some heuristic of the domain on top of the prediction of our model So something useful that we found is in order to check if a match Predicted by the model is actually a match or a false positive There are some rules that we can apply the first one is about the colors And that means that if you have within a match to produce with a very rare color names And these color names are not matching that is very likely not to be a match It's going to be a false positive like in this case you have address whose color is my night blue and this product is also available in See blue on a sky blue and this product is a matching against another one Which is blue, but this product is also available in my night blue that means that the match will be my not blue It's my net blue. Okay, so that is a pretty good indicator that you are dealing with a false positive Something similar happened with the material if your system is able to extract the material in a perfect way And you have the a hundred percent material for both product and this percentage is not matching Corrally that is very likely not to be a match. It's another it's another way to find That is going to be a false positive. Okay, and something similar happens with the price although some of the idea I mean some of the goals of the whole System is to compare prices if you have a match with very very different prices That is very rare. Okay, so it's very likely that that produce Not going to be a match. So you will need to do something about that apart from this There are some tips that we can set like for instance as Alisa says Within this domain the codes are very rare are quite difficult to extract and We will suggest not to rely on the codes. So the codes for this domain are pretty useless Okay, in our case, we only have a 50% of the produce with a code with a barcode another thing we have to think about is When we are Defining the cluster to reduce what to deal with this calamity of the problem What is the granularity of this cluster in some cases for some category? It makes sense to have like a very small cluster of very granular cluster like the genes because genes are pretty easy to detect within the Trousers domain or tracer category and in other cases like the dresses Maybe it makes sense not to have so granular cluster like party dresses Maybe it makes sense to have something like dresses because party dresses are not so different from dresses So it's something I have to think about when you are deploying a system like this. Okay, and just to finish this talk Some lessons that we are learned over the past few months The first one is as you may imagine having an standard taxonomy and a proper way to extract the characteristic of the product Is key in this problem without the common taxonomy. It's almost impossible to perform a wood product matching Okay, the second one is that the tools to visualize and understand your data and to the bar your code Is really really important in this domain is very I mean in general It's quite tricky and it's very easy to have false positive and false negative So you need a way to control that and to visualize that Also the code the bar code, especially the universal codes are very scared But that's really really useful. You have those Trust them and use them to generate your training data. It's a cheap way to get a train data up and running quite easily the Next one is that at least in our case at the end we're up using a lot of more Qlity as a student that we expected and as I said, it's a very tricky problem It's very easy to have mistakes And if you aim to increase the precision under recall at the same time You need a proper quality as our own pipeline to check all these Errors, okay, and the last one is kind of transversal across all machine learning problem And it's that the feature you use to compute the similarity and the data you use to train your model Is much more important than the model that you may be using in our case We are using gradient boosted tree which I mean they work pretty well But we also tried with neural nets and the differences are not really statistical significant The differences coming from the features and the data That's all thank you. That's you have any question Provide in some cases. I mean some companies approaching you to do a matching of their products with someone else It's another website that all other website might not be contracting you So how do you extract data from those other websites? We extract data as Google will do basically we I mean all the data which is public and be extracted And we have a bunch of spider basically doing that for us So we have a whole back-end team dealing with that basically extracting the data and moving that data into a database And this is kind of previous to this whole thing, but it's like like Google can do Okay, okay. Okay. Thank you. Thank you