 Hey, I hope you guys can hear me So this talk is going to be about the product matching problem This problem is something that me and my team have been working for for about One one and a half year now. It's a problem. That's very easy to describe very impactful and Incredibly hard to solve and we have made some progress good progress here and The idea is to talk about the problem why it's important and how we solved it and the learnings along the way, right? So let me start by Talking a bit about index It's best to understand what index does using this this analogy, right? So what Facebook is for people Google Maps is for locations Index is for products What do I mean by that? So basically we are trying to organize analyze visualize all the product information in the world and make it available so that everybody can act on it what this includes is data collection going from unstructured data to structured data analytics on this data visualization and Personalization and surfacing the data, right? So to give you a perspective on where we are at today We are on our way to building the world's largest product database. What do I mean by large? We have seven hundred million products. We have 40k brands 7k and more categories about 10k plus attributes so that's where we are at and We'll soon cross the billion mark, which would be very very interesting Okay, so that's about index Let me just jump into the product matching problem as I said, it's a problem that is very easy to describe and Very hard to solve this is a shoe at five different retail sites It's the same shoe. These are variants so the product matching problem is simply to say that These five products are actually the same product. That's the problem, right now To make this problem a little more concrete I'm gonna introduce some details Let's just assume that I have a crawler and I have crawled a bunch of product pages Let's say about a billion not too much a billion and the idea is for This matching process is to produce a set of URL or a group of URLs Such that the two products two URLs in a group are actually the same product, right? So this block right here is the focus of the talk and This is you know a very intuitive description of the problem So before we You know try to solve the problem. Let's let's get a sense of why it's it's Would solving I mean what what is the value at what is the business impact? So the way we look at it The product matching problem is central To almost all problems in retail analytics The reason is it's if you start with questions like who are your competitors? What prices are they selling the products you carry for? What are the products that you should carry that they are carrying right now? What are the products that they don't carry and you carry and you can promote them? So those are some of the questions that you would want to answer and in order to answer these problems well What you need to be able to do is to be able to match products across stores So if you look at this diagram here, I mean, it's a very abstract view of what what we are trying to do These are products and I'm just trying to put or learn or induce these links across products scale matters in The sense that for this information to be useful You need to have as many products as possible You need to have as many sites as possible. You need to have coverage in a particular category So those are the dimensions of scale and you have to get them right if you are to you know Generate insights over the data. So this is just you know our app And it's just showing pricing comparisons for a particular product, right? But that's just one piece of the pie that you can have when you actually solve this problem well So that's the problem and that's the impact of the problem will now start looking at How can we solve this problem and there are multiple sub problems here? There are many moving parts So what I'm going to do is I'm going to start with each of these parts piecemeal Detail them and towards the end of the talk. They'll all fall in together So let's start with the first problem which is parsing What is the problem? I have an html page and I need to extract key attributes such as titles image URLs Prices descriptions spec text from the product page. I Cannot hand write parsers They'll be very hard to maintain at this scale. I need something that works in an automated way I can develop tools, but it's not going to scale. I need something that works in an automated way So let's try to look at it from a machine learning standpoint, right? How can I model this problem? Well one way to look at this problem is what is the html page? It's a DOM tree and every node here is either a title or not So if I am able to formulate this as a binary classification problem Do good feature engineering For each of these DOM nodes Train a binary classifier, which tells me whether it's a title or not and I'll train binary classifiers for descriptions image URLs Pretty much everything I need. I could attack this problem, right? I could attack this problem in a very Principal oriented as well as a scalable kind of a way, right? Why do I think this might work? I mean what what is the intuition here? So the intuition that I began with was if somebody showed me a Chinese Retail site. I don't speak Chinese. I mean Mandarin that is But I am able to identify the title. I'm able to identify the image. I'm able to identify the price. Why? Because this is about structure. It's not about content It's about structure and if we can do feature engineering. Well, this problem can be solved. That was the Hypothesis right? Well, turned out it does work but there are a fair amount of Details to take into consideration and this is the first learning that feature engineering is key More than the algorithm the feature engineering at least in this particular for this particular problem was key to getting this problem, right? Now let's look at what kind of features just to get a sense for what kind of Things are being extracted. I mean that's the title. This is the variant information. That's the description That's the image. These are additional images, right? That's the price So there are two kinds of features or two families of features that I can extract For each of these dom nodes the first one is Features that I get from the HTML. That's the dom tree You can think of it as a tree. So what are the parents of this particular? Vertex what are their tags? How far away are they from the root? How many children do they have? How many children do their siblings have? What are their tags? There was a massive massive feature engineering exercise and that got us far But if we didn't get there totally, what were we missing? What we were missing is We don't see a page as The Dom we see it Visually, right? It's it's it's an image. So what we started doing is we started rendering the page in a headless browser I found a phantom JS and that allowed us to get a set of visual features What do I mean by visual features? I have x and y coordinates of this dom node after it's rendered I can extract features about the color contrast. You'll notice that the price is The highest contrast on the page often often, right? I don't know whether that's a design thing. It's just something that we have looked I mean noticed in data So with a bunch of these image feature html features and these visual features We were able to train Classification models that allowed us to solve this problem fairly well now there is there is a problem here The problem is that the visual features require you to render the whole page. That's pretty expensive So we don't do it Everywhere we do it for the complicated sites where the html features work. We just use those right, so that's that's how we attack the problem and Solve it. I would say fairly well I mean, there's still room for improvement. There is still room for More feature engineering. There is still room for getting more data an interesting thing that you can do is build a classifier based on these features Have them generate data and use that label data to train this guy That works because you need a lot of label data for it and there are limits to how much budgets you have on crowdsourcing So that's the parsing problem Okay So moving on the second important problem that you need to solve is that of product classification It's a very simple problem again. It's the what is it question? What is that? It's a laptop That's it. What human beings can do very easily. It's hard for algorithms and machines to do So what you're trying to do is you're trying to look at all this extracted data that you have from the parser and Go to a category. This category is typically a leaf node in a taxonomy This taxonomy needs to be very specific in order for you to do a good job at match Any guesses as to why these two products might be confused with each other They're both pumps I've learned a lot about women's shoes over the past one and a half year More than my wife would like which is interesting But yes, this is a pump shoe and that's an actual pump And if you are just you building a classifier based on titles and if pumps has a very high token weight You might get this wrong, right? So Yeah, this is a challenging problem and the key learning here was using ensembles You cannot just use one class of data to solve this problem What you need is a battery of approaches and a good ensemble in order to solve this problem What are the different approaches? Let's start with the simplest one, which is the breadcrumb mapping Data on retail websites is at least the classification can be wrong Sometimes but it is still good, but people have different taxonomies, but if you can map them offline You can use that information as one of the signals You can train classifiers we use linear support vector machines on Titles and descriptions mostly text data you can use a CNN on the image You can also introduce background knowledge now MSc direct would not carry women's shoes MSc direct is a major industrial supply retail outlet in the US, right? So you can you can make a certain Assumption so you can introduce certain features that capture your background knowledge that certain classes of products Will not be available in certain sites, right? so Again, this was a massive featured engineering exercise. It was a massive Ensembling exercise finally we got to a point where then we had reasonable accuracy Enough to make progress on this problem and again, I like to stress the use of ensembles and not Sticking to just one particular source of data There are a few interesting things that you can do here again Where in you can crawl these images or get procure these images offline and then train models and not use the classification online if it pretty much dominates over the text classification So it's important to note the balance that everything associated with images including procuring the data training the models is going to be expensive But it's it's it's resources well spent moving on the next Important problem is attribute extraction Here what you're trying to do is you're trying to go from all this information that you extracted with the parser And you're using a predefined schema to get attributes like brand size color packs so on and so forth Without getting this right. You're never going to get the product matching problem, right? And the schema tends to be very large. It tends to be category specific if you have Classification errors, they're going to bleed into these errors It's going to be a cascade and that is going to be very hard to solve So it's important to get classification, right? And it's also important to get the parsing, right, right? So it's it's like a pipeline but an error cascade So that's what happens in attribute extraction, right? There are a large number of attributes. There's bad and missing data. There is you know a lot of challenges there so Just to give a sense of why I mean these are about training shoes one is from Nike one is from Reebok and If you don't have the brand, right, you could still use the images that will get pretty expensive, but it helps to Get the model that extracts the brands, right in order to solve the problem, right? So we mostly use CRFs for this Typically we are trying to extract so for example a rubber sole so that's sole being rubber brand being Nike Colors can be pretty absurd sometimes. That's the color of the shoe. It's pretty weird But whatever I mean the point is you have to go ahead and define your schema train models on each of these pieces of text that you have To you know extract the attributes that matter for matching, right? So that's that's the next problem now Fine, I mean You're at a point wherein you can actually start attacking problem. You have the original data that you started with That was parsed. You have the classification category and You have your attributes. We refer to this as an enriched product record. It's a product record with all the information So I guess one way of attacking the problem is to say I'm just gonna come up with a distance metric and I'm gonna do, you know an all-pairs on these products and If that distance is small, I'm just gonna say that's a match It's a pretty reasonable approach The problem is with our scale it would take roughly around 18 years to do that And and square doesn't work and square doesn't work in the sense that at this scale Even if so this I mean I made the assumption that the particular pair-wise matching would take a nanosecond and Then it's 18 years. I guess our funding would run out pretty soon I mean, you know not class for 18 years. So we had to do this faster, right? But you know, it's it's not a big deal in the sense that you have all the information that you can use in order to You know do this. I mean solve this problem better What do you want to do is block on a bunch of things that Will take your products and put them into groups such that the probability of any two products being the same is high So this this step is essential You can do blocking in many different ways you can say my classification and brand extraction is really good. I trust it Which is why I'm going to put all The products that belong to a certain brand and a certain category in one audition It's a simplistic approach depends on how much you trust your classification depends on what particular level in the taxonomy You you want to block act so you might say my leaf level accuracy isn't good enough But one level about that. I'm good. Let me block at that level So that's that's one way and you can take different approaches for different top-level categories, right? The other two approaches are basically based on string similarity join and hashing The key here is no one approach again works You're not going to be able to take one particular way to do blocking and Extract or get good recall on your matches You're going to have to try different approaches and then you're going to have to merge these groups to get the final set of groups On which you can you know do a pair-wise inference. So the learning here is One doesn't work again. This is an ensemble, but I won't call it that because that's more from a classification, you know terminology as such Okay, so now you have these groups and what you need to do is Simply do a pair-wise Computation right a pair-wise distance computation again that would require a pair amount of feature engineering, right? And then you're going to have to do Constraint clustering and let me get into some some some more details But let me first talk about the pair-wise distance computation. I mean what do you have? Here is you basically construct a Distance metric you can learn this metric or you can define it based on background knowledge it basically is going to take two products and you're going to construct these Representations based on all the data you have so for example, you might take the title and convert it to a bag of words Titles across retail Websites are not the same not exact matches So you have to be fairly smart about that you might remove some stop words that don't really matter For example, it's a good idea to remove the But turns out the North face is a brand North face is not a brand the North face is the brand, right? So you can't really remove it in all cases. That's tricky But to cut a long story short what you can do is you come up with a distance metric or learn a distance metric That allows you to do you know to get a distance between the two products that you have in one of your groups, right? Now you could then given that you have this distance metric go ahead and just cluster these products and Then whatever works or whatever ends up in the same cluster is what has matched, right? That's that's a match. So you have done your blocking you have used all your attributes and you're good to go the problem is It's it's that there is a extra layer of complexity to it Certain products just have to match Certain products cannot match. When is this suppose? I do have UPCs for products I do have the MPN the manufacturer part number for products. Let's say I have a product line for certain products I'm gonna have certain background information about these products That I have to leverage in order to I mean I can't ignore this information this information a allows me to do better second it also allows me To impose certain constraints. So for example One product at a store can only match one product at a different store. This is the match at a store constraint Now it's possible that a duplicate product exists on the same store Marketplaces are nightmares with this but With these constraints and with a good distance measure you can basically do you know constraint clustering And in its it's it's fairly simple. What you're saying is you start, you know bottom up You start with one product clusters You keep merging them and just make sure that your constraints are not violated in the sense that these two things if they have to be together You just go ahead and merge them immediately if they cannot be together you prevent two clusters from merging, right? So that's that's one of the ways to solve it there are certain, you know You could improve the efficiency of this but that's that's the fundamental idea But the key learning here is Just just distance metric computation and cost clustering based on that Doesn't work and you'll get absurd results and they'll keep You know popping up all over the place what you really need to do is constraint clustering So that was an important learning so now we have all our moving parts in one place and We can you know look at it, you know step back a bit, you know, how does this thing work? So we start with a bunch of html's we parse them to get a product record Classify the products extract the attributes block and construct product groups do match inference to get the matches So that's that's the flow that's the data flow and each of these blocks You know contribute to the final process of matching Right. Let me just talk a bit about how we go ahead and evaluate this. I mean, it's pretty much standard precision and recall but You know, there are there are a few nuances here in the case of precision If given that you have a good spot checking budget, this is easy to do you have a bunch of reported matches You can go ahead and decide on what sample to take based on, you know confidence interval that you want and just send it for spot checking and get a You know get a sense for how precise your matches are you might want to do it at a customer level at a Site level at a category level at a leaf category level you want to segment This particular population of reported matches so that you are able to really find out where you are doing Badly this is one of the big problems because what happens is if you just take one random sample from your entire set of reported matches And that number is high that doesn't tell you How good you are doing with a paying customer? So you're going to have to sample here sample his products or you know separately right that that's and I mean That particular customer and their competitors. So the key idea here is if you care about a certain segment You've got a sample from it and then given enough of a spot checking budget a crowd sourcing budget You might outsource it you can get a very good sense of the precision. This is a good problem I mean this this can be very easy to evaluate. It's it's not hard just throw money at it the recall on the other hand is Extremely hard to estimate. Why is this is a needle in the haystack problem, right? How many matches would I get if I did all pairs? There's no way to find out the population of all pairs is so large as I showed in the plot If I go ahead and sample from this population. It's the population of matches is going to be extremely rare so This is something that I'm trying to you know, I mean we're really trying to work on getting a Good estimate of how many matches are really out there Some of the ways that we have used is to Have people actually search for products on different sites where we Expect them to be sold to put together a blind set and then we compare our results against this blind set Again, it's it's a very rudimentary approach But it does allow us to get a sense for where we are recall wise The other thing that you might do is say that I'm going to construct an envelope an envelope of Potential matches, so I'm going to say anything that is within a string, you know a jacquard similarity of Whatever I'm going to consider it as a match and I'm going to sample from that population and then You know treat that as the bounding envelope so a lot of ideas here, but it still remains as a Needle in the haystack kind of a problem hard to estimate But in general, I mean people are taking pricing decisions on this so they care more about precision So I mean we have that part covered Okay So towards the end, I'll just leave you with you know One one key learning, which is not that deep is that In order to do all of this you need to build the right set of tools looking at your data You know annotating data Looking at similar images and we start doing this with you know little bit of scripts and a little bit of you know excel or whatever and That just gets out of hand pretty quickly. You need this entire ecosystem of tools In order to do the machine learning stuff, right? Right, so I guess over a period of time. We have built a lot of tooling around being able to you know Download this data being able to identify similar images similar titles cluster them so on and so forth plus have people be able to Annotate this data generate training data. So I would say tools matter. I mean build them early on I mean because hey the longer you try to make do with back tools It's it's going to really impact your velocity again It's a very trivial kind of a thing but in the long run it matters matters a lot, right? So that's pretty much it. I'll I'll leave you with what index does and I'll take any questions you have Hello, really nice talk. Thank you and I'm right here. Oh and couple of Points, so you were mentioning that the feature engineering was really hard. So I was wondering if Back propagation was a way to identify using some You know deep learning algorithms, you know, they will do it detect the features themselves Is that an option for the first problem the feature engineering and also one of your slides had something CRF? I didn't understand that Okay, okay. Thank you. Okay, so to answer your first question I mean that's that's a very important question that I get asked By my stakeholders on a day-to-day basis feature engineering constitutes somewhere between 30 to 50 percent of all our activities The hope is that deep learning fundamentally replaces this step. I See this happening in the case of images Really fast and really well in the case of text. I'm sure it'll get there but not not that rapidly But again, you know things things might change, but we have had really good Experiences with doing deep learning with images rather than with text Hello So about the slide where you're matching products and you're comparing to each other based on their attributes so What about something like? So suppose if there is a Nike shoe you can always get the catalog and get group the products according to like catalog from their official product listing website or something like that and Find the distance closest to that catalog and just group it accordingly so catalogs or crawling or getting data from manufacturers a I mean, they're not willing to share that data. They have current data They don't have data that they'll share over, you know a long period of time Secondly the parsing becomes a more notorious problem So it would be great if you had feeds for all of this and if you were able to use these feeds and I guess you're going to venture out into that but in this particular form We were trying to solve the problem purely starting from HTML pages But the thing is so that both come under different categorizations of data, right? So one is the meta so that will be something like a like the metadata So that's like the product listing of the catalog like you can treat it as a source of truth And all the other things can be treated as something like transactional data or the data which is required for processing So there can be other things for which can come under the metadata category as well like a list of recognized brand names So like you said the north face, right? So that they can be okay These words are definitely recognized brand names if you have certain amounts of metadata So do you go with that? So again if you are Google people will share this data with you willingly if you are a small startup They wouldn't that's one reason second Our entire line of data products around this product database is about competitive price intelligence Why would a retailer give me information retailers would never share information, right? Unless there are you know for some other reasons They are customers or whatever Secondly Manufacturers again have no incentive in sharing this information. They'll have it on their websites. Sure That's because that drives sales for them, but I mean I agree that in a sense It is an artificial problem in the sense that if you had that data if you had a good clean structured data All this wouldn't be required. That's Correct, but from a purely business standpoint This data is very hard to pick procure and that is the whole reason why we do this the hard way I mean it's a business reason rather than a technical reason. So yes every retailer and manufacturer if you had access to all their data This wouldn't be a problem Thank you. Yes You index these separately in that case the UPC might be same across products So do you have you face challenges and how did you address those? That's a very good question. I'm sure you actually play with this data. That's why you asked that question Packs are notoriously hard packs boxes items We have models that individually recognize these and put them into attributes So again to give a context for everybody a retailer will often take five pens Treat them as a single unit of sale and then sell that right The price of this is going to be well five times of the price of a single pen But on this on the page, you're just going to have pack of five somewhere there so we do attribute extraction for that and That's how we solve the problem and getting that attribute extraction step is critical. Do you work around that? How do you work around that problem? so Data around UPCs is this sanitized based on the You know the titles the images that they will have right and once I mean there is a process of sanitization to UPCs No piece of the single no single piece of data is Gospel right so you do have to sanitize that but if you have a high level of confidence Then you can probably use this data So yes in the sense that there have been enough and more cases where in UPC data was bad But the way we take care of that is we we look at other pieces of information also Sanitize it separately. So next question is Is it done? Hello So next question is about category finding where we are doing bread come mapping. So that Taxonomy building is basically how we are building that taxonomy Okay, so I guess I I mean I would I can point you to somebody's master's thesis out of IIT Madras Who who actually worked on this problem very recently? I guided him so to give you a very small but I Good answer to this you start with taxonomies from multiple different places you take you train Classifiers for each of them right then you take a product from one store and see what classification it got here And you take the classification You know you take a product from here and predict based on that taxonomy What that allows you to do is to construct a matrix Where in I have a category here and a category there and then based on this kind of an analysis You can build a richer taxonomy. So let me give you a very specific example This is a very interesting problem There is a site called sweetwater which does just does music instruments now their Taxonomy when it comes to music instruments is way deeper than what Amazon has so when I'm constructing my master taxonomy I want their entire tree to go sit Right in the place where Amazon is so this kind of an automated approach of training classifiers on each taxonomy Have them make predictions on other products and then take a look at the breadcrumbs of how they align is The approach that you might use in order to construct this You know this really large taxonomy again. It's not perfect. You still have to sample spot check validate But that's that's how we attack that problem Yeah, okay. Yeah, sorry So I work at Lazada. We're part of our business We're online retailer part of our business model is one of those marketplaces that you hate So and we don't do that to you on purpose, okay? This is the data that we get from our sellers It's often dirty and messy So we're going through the actually the same exercises that you are Trying to entity match all the data and recognize everything and get it nice and neat and orderly It was great presentation. Fantastic. Thank you Sort of stuff that keeps me awake at night thinking about So in the process that you're doing this how often do you turn to? Manual people to QC this so at some point you're going to get some certain level of confidence That you can be able to push things into yes This is a match or no this isn't a match then you're gonna have some middle ground and you have to spend some time With human people doing yes knows Okay, what I mean roughly what percentage of the products fall into that okay, and can you use that training? Data to actually improve results over time. Yeah, so to answer your question We have segmented all our our data Or all the inferences that we draw from our data into segments based on how important they are So if a customer is looking at a certain piece of data If they are a paying customer or a POC customer, that's the highest tier, right? If there are prospect they're the next year Then there are certain things that we care about from the point of view of scale if we want to you know for many different reasons So basically We have broken our data down or broken our you know the output of the process down Into these tiers and each tier has a certain confidence of interval confidence interval that we want to operate at Based on that interval and based on the you know How I mean that basically represents how much we care about it There is a frequency at which we sample and there is a sample size that that we take So the resources that we have are distributed based on these tranches And we are sampling and spot-checking all the time because Yeah, I mean it's it's one of those things especially when it comes to prices. Yeah, we do it I mean we are very paranoid about this in terms of budgets I mean are you asking that question in terms of budgets or I mean In terms of the number of items spot-checked So typically for a 95% confidence interval with a 50-50 distribution with a margin of error of 5% a Sample size of around 700 is good enough. That's the number I remember or the top of my head But then again those are the three numbers. I mean what is the underlying distribution, right? So if it is a rare population, that's a problem. What is the my margin of error and what is my confidence interval? So we start with the margin of error and the confidence interval We have a sense of the underlying distribution and we let these two numbers guide it So for example, there are cases wherein we want to operate at a 99 confidence interval and we want 2% errors. I think the number that was asked for was 97% accuracy I don't know why it was 7 or 6 or whatever. There was a number that we need. I mean, that's an SLA, right? We need to meet that. So based on that we do you know the sampling and the spot-check Yeah, hi here So Mayan is a follow-up question on what he asked more on the quality side of the whole process since we have a series of Tasks that are that have a chance of something going wrong somewhere So I understand you have manual QC as part of every phase. How do you ensure? Have you also automated the whole quality process? How do you ensure that everything is just right to you know, go ahead with Okay further, you know business Notes, right? So I guess one of the parts of his question that I didn't answer would might answer yours Every single model that we have is calibrated in the sense that it produces a probability This is the single most important thing that you have to do in order to get this thing, right? What the probability gives you is the ability to look at low You know Predictions made at low confidence interval at the low confidence send it for spot-checking completely You know remove them from your output use that and have it annotated use it as training data So for every stage on a continuous basis, we have how many you know predictions were made with a low Confidence and this thing directly goes into you know the crowd sourcing pool as such, right? So yes, this process is completely automated All we have is hey, there's an API Submitted it goes there the number comes back. There is a dashboard if that number is read then I know I'm not getting sleep tonight right at that kind of a thing Yeah, please take all your other questions offline. Thank you Nikhil