 Good afternoon everyone. Let me quickly introduce myself. I'm Vinod Kumar. I'm currently the CTO and managing director of bloom reach India So the last to give a quick context on this topic right so the last three and a half years We at bloom reach India. I've been building a e-commerce search engine called snap Which is currently live on a lot of customers in the US the likes of Sears staples Walmart, etc Prior to bloom reach I used to work at Google where I was particularly for Google news ranking I wrote most of the ranking code for Google news back then so in this talk What I wanted to do is to sort of draw on my experience working in the web world the news world and then you know The e-commerce world over the last three and a half years and sort of provide insights Into the challenges and then briefly discuss approaches on solving some of the problems in the e-commerce search space, right? so Search engine right the moment we say search engine right the thing that comes to everybody's mind is Google and talking of Google Right can anyone tell me what this number is? sorry Yeah, so basically this is the total number of index pages in Google right now at least a year back Right, so when you give a query basically Google is searching 60 trillion documents and giving you the result So suppose I search for NASA. What's the result that you expect? I? See some people that are taking your mobile out don't don't take a mobile out You don't want to ask Google about what will be your first result for for this query, right? So so what should be the first result for NASA? Sorry wiki page of NASA or That's for what or basically the NASA website, right? So does anyone disagree that this is a I mean does anyone think this is a bad result? Okay, right, so let's say the query is steering right. What do you think should be the result? Yeah, Alan Turing's Wikipedia page, so everybody agrees with that or somebody doesn't agree with that Yeah, maybe Turing test, but yeah, this is close enough, right? So tallest building in the world Which Califa right so this should be the result, right? Does anybody disagree? No, right, so let's look at what's the problem that Google is trying to solve right in most of these cases there is one or two right answers correct and Basically a challenge is to retrieve that one right answer for the query from a Haystack like the problem is like searching for a needle in the haystack, right? You have 60 trillion documents out of which you want to find that one relevant answer for your query So now let's switch to e-commerce Right, so a typical corpus for most of our customers in the e-commerce is 1 million products right just to contrast right if web is about searching a Region whose area is equal to the size of Africa then e-commerce search is basically searching a Area of size of basically coven park Right, so you can find the contrast right? This is I mean if you can search the entire verb web, right? Is there is there is it a big deal to actually search coven park? Right, I mean this this is contrast the areas Right, so I think so the question basically is so how hard is building a e-commerce? search engine is there really a problem of scale here and Is e-commerce search engine just a microscopic web search engine so the pun is intended You know there's a million document here and trillion documents here It's really a micro sort of a web search engine right is that the case So so one of our big customers for bloom reaches Neiman Marcus, which is a iron fashion retail outlet in the US Basically most of the users on Neiman Marcus customers on Neiman Marcus actually are women and really rich women So you know let's say so we so we built a search engine for them right and we are live on their site So let's let's understand this problem, right? Suppose a user queries shoes on this website Neiman Marcus has 20 different categories of shoes and From 100 different designers and almost all of these designers are world-class Designers like you name them there they are on Neiman Marcus and Basically, they have 10,000 shoes in their inventory Right so now and so basically all of them are relevant unlike the web case, right? All of them are relevant and there is no right result So if I want to show you a set of 10 results and ask you is this the right result I'm not sure you guys will agree and say yeah, this is the right result You may you may have a different opinion of what should be the right result So so the problem is quite different here the problem is one of choosing the first among equals, right? Or the best among equals So how critical is this problem to solve, right? So to answer that question, let's look at this graph. So this is a graph that I Constructed using data from most of our customers Our customers on our most basically all of Amazon's competitors in the US are pretty much our customers, right? And what does this graph show right suppose 100 people? Basically land on your site, right? Only 50 close to 50 people right actually go on to click and go to another page Right by the time people have looked at three pages, right? That is only 25 to 30 percent of your visitors, right? I've all I've actually gone to the third page and this is basically an exponentially decay decaying curve, right? So you're basically losing customers on every single page, right like Yeah, so how many of you over here worked on marketing or growth hacking? Anybody? Yeah, so you guys know how hard user acquisition is, right? And if you're going to lose 50 percent on every single web page, right? Then basically that's not good, right? So so basically right like most of e-commerce the most of the pages on e-commerce website I call them search pages like one are the pages that are Rendered explicitly after a query category pages are also search pages which are generated using an implicit query, right? So pretty much an entire commerce website is about search pages So so if people are going to jump off the website at this rate, right? That's a lot of pressure on the ranking. You need to get it right So so let's try to design an algorithm right for for building an e-commerce search engine So what's going to be the page rank of e-commerce? Suppose we rank all the relevant documents So we said there are 10,000 shoes, right? Now, how do we rank them, right? A very simple algorithm is to rank all the relevant documents by their celebrity, right? So we need to define celebrity. So can somebody come up with like what could be a good celebrity metric? Sorry How many of soul? That's great Anything else? Mm-hmm, right. So let's let's start with something like let's start with a sold case, right? Let's define celebrity to be revenue, right? So what are the issues with defining celebrity as revenue exactly new products? Yeah, you can you can handle out of stocks, but sorry Yeah, right. So If you define celebrity to be revenue, right? Lord, there's little scope for newer products to make it to the top, right? The old all-time revenue gardeners will continue to dominate right and So what does this mean, right? That means that revenue cannot be all-time, right? You need to basically wait your revenue by recency So revenue that you got from products in the last week are obviously more important from Compared to revenue that you got from products a month ago, right? Things are changing a lot in the e-commerce world There are fashion trends and even in electronics are new products released on a daily basis So you can't look at all-time revenue at all, right? And the second issue is presentation bias, right? You forget new products, right? Products at the top would basically continue to dominate the ranking Right like the imagine we can products in the fourth row like or or on the second page, right? Manage to come to the first page, right pretty much No, because these products will gather some amount of revenue and therefore, you know your ranking will be static and what you'll find is that your page is actually slowly and slowly losing revenue, right and Basically the overall revenue for the site is going to go down So how do we solve the problem, right? Like so one thing that's clear is that revenue alone is not important Impressions are equally important, right? So You not only need to look at revenue You need to start tracking product impressions and the reason you need to do that is you need to know how many times was a product shown And how many times was it bought, right? Otherwise, you cannot you cannot handle presentation bias So let's try to redefine celibacy, right? So Let's so we said we'll not take all-time revenue. We will take revenue weighted by recency So let's say our time-weighted revenue is revenue weighted by recency, right? Let's come up with a formula for that I'll leave it to you guys to do that and then we say celibacy is basically a function of time-weighted revenue and the number of impressions that a Product got right like one thing too. I want to mention over here is Tracking impressions is a non-trivial problem, right? You want to know whenever a particular product Came into the view of a user, right? You want to know that it could be somewhere at the bottom of the screen on top of the screen like whenever it was shown, right? you want to track that impression and what is you equally useful is actually to track and Make note of the position in which those products were shown as well, right? So who can tell me what this I mean What would be a simple function for? For celibacy, how would you define F? anybody Yeah, so that's yeah, you could basically take the time-weighted revenue divided by the impressions, right? So that gives a good handle over how a product is performing and and and and with this approach, right? What you can do is even a newer products, right? The moment they start getting impressions and if they these impressions convert, right? You can quickly sort of, you know move them to the top But you need to do one more thing, right? Which is that the algorithm not only needs to you know define celibacy, but it also needs to create Scope for products to get impressions which is different from the ranking itself, right? Like how are you going to introduce these products into the ranking stream onto the pages so that you start getting? You know impressions for these products So is that enough let's say that we've handled all of these, right? So is that going to be enough? So right now our celibacy is sort of a nice machine learning algorithm, which is based on the past performance of products, right? now What if we offered discount on the product? Right, so the question is does celibacy remind the same anybody? sorry Yeah, celibacy, yeah possibly increases, right? So just take a quick example, right? If the product is iPhone 6 but at a 30% discount, right? Like how much do you think the celibacy will increase? What's multiple? Sorry? 10x, okay 1000x, okay, right? So so there are yeah, so there are a lot of Apple fanboys probably, you know waiting for that 30% discount to buy the iPhone So let's say I let's say I'm a little more conservative right coming from Google. So I say 2x, right? So one could expect maybe 2x increase in celibacy for iPhone 6 at the discount So how did we determine to as opposed to 10 or opposed to thousand, right? We don't have any past data so there is a need to come up with a scheme to determine this number, right and Is this number going to hold true for all products? Is it the case that if I give 30% discount on another mobile phone is celibacy going to increase by the same factor, right? Can one expect the same kind of increase in celibacy? No, right? So actually this brings us to an interesting question, right? What is actually a document in the e-commerce world, right? In the web world, we know that every web page is a document So we were thinking that a product is basically a document in the e-commerce world But as we can see right celibate like attributes like celibacy is not a function of the product alone, right? A product at every price point actually has a totally different celibacy So maybe should we should should we model a document as a tuple as a product as something which comprises of both product and price Let's say we do that, right? What about skews, right? Dress of a particular size right may perform better than the same dress at a different size, right? Else size dress might actually be performing better than a size dress, right? For exactly the same model, right? So should my document be a combination of products, QID and price? So, yeah, so I just wanted to leave that, I mean just bring this to your attention to think about like When you're designing an e-commerce search engine, right? You need to look at your application and right from defining what your document is, right? There is going to be repercussions on how you model your ranking algorithm as you go forward, right? Let's take a query now, right? Suppose you basically like like based on whatever we've discussed so far, right? Let's say we've implemented a ranking algorithm, it handles you know pass data, let's say it handles celibacy Let's say it handles price variations and all of that, right? Now we've seen this query at Kmart which is one of our customers You see a query called frozen, right? And when we actually were launching our search on their site The results shown was frozen fish, frozen vegetables and you know ice cream, right? So I said a query frozen in time but to me really it results frozen in time because this happened in the meantime, right? The frozen movie got released, right? And when users type in frozen no longer are they looking for frozen food, right? So what are they looking for? So they are looking for frozen merchandising items, right? So they're looking for frozen dress, toys, you know other accessories with you know frozen With frozen on it, right? So basically the meaning of the query has changed over time And if you were blindly look, okay so the e-commerce search engine had got it wrong, right? Let's look at what Google did, right? If you had to go and query on Google saying frozen, right? It wouldn't show you frozen foods, right? Even before the movie was launched, it would be showing basically the Disney frozen movie related results, right? So how did Google get it right? Because Google is learning from the web, right? Just before the movie was launched, Google basically Disney would create a number of pages around frozen There'll be a lot of review articles on frozen or articles with expectation about frozen going to be released There's some buzz in the news, then there are social media mentions And also there's user query interest around frozen and then you're looking at the results stream, right? Google can learn that frozen no longer means frozen food products but or the other meanings of frozen But the movie frozen, right? Now why did our e-commerce search get it wrong? Or a typical e-commerce search engine get it wrong? Because the park is too small, so which park is too small? Basically, Cuban park, right? We said that e-commerce search is looking at a corpus which is as small as Cuban park, right? So you pretty much cannot learn from just the corpus of 1 million documents, right? So query understanding does require mining the web and other social data streams So in other words, right? In order to build a e-commerce search engine, you can't just look at the 1 million documents and build everything on top of it It requires you to do web scale data mining So that means that you have to still study Africa to basically search Cuban park, right? Going back to our first example So I also wanted to point out one more thing about how real time e-commerce search needs to be right So here's an example, right? Let's say the query was Virat Kohli-Jersey So how many of you think that this is a great result? So who, sorry? Right, yeah But otherwise, it's a good result, right? Exactly, like as you mentioned, right? What if the IPL season was about to kick off, right? Then this is the jersey that you probably want to show, right? Not this one And if you were looking at your corpus And you were modeling everything based on your previous performance, right? You will probably continue to show the first jersey Which is no longer what people are looking for, right? So how are you going to change your results? As, you know, things around you change, right? There are news events, then there is a seasonality factor Like with IPL, it's a mix of both news and seasonality But how is the e-commerce search engine, which is just looking at one merchant's data Going to adapt to, you know, changes, you know, around changes happening in the world Like frozen movie being released or, you know, like the cricketing season changing, etc, etc, right? So which means that, you know, you need to look at query streams And you need to understand what those query streams mean Which means that you need to start looking at web and news and social media In order to understand, you know, what you need to serve, right? So, yeah, as I said, the corpus is not going to tell you any of this, right? So you have to learn from other sources So how many of you know what this shoe is? Okay, sorry? Okay, yeah, it's a pump And it's called a white swan shoe on Neiman market It's a very popular shoe, actually If you go to Pinterest also, you'll find a lot of people have pinned this shoe, right? This designer is Manolo Blanik and this is how he looks And the interesting thing is that it costs 88,329 If you want to buy it, like you can try it on Neiman Marcus right now if you go, right? Okay, so here's a question for you all, right? This product garners a lot of pages on Neiman Marcus, right? As I said, people are pinning this on Pinterest They're sharing, you know, stuff about it So obviously it's getting a lot of articles, sorry, getting a lot of attention But it's making no money, okay? There's no conversions, right? So the question to you all is Should we show this product, say for the query shoes? So how many of you think that we should show this product? Okay, how many of you think that we should not show this product? Okay, so basically there's a 50-50 split that I see, right? So how do we resolve this confusion, right? So let's do an experiment, okay? Which is what we did at Bloom Beach, right? So this is the set of results that we were showing for the result Shoes on Neiman Marcus, right? And what we did was we basically conducted an A-B test And we basically replaced one of the results with the white swan shoe, right? Now who can tell me which side performed better? Right side, okay? Yes, actually, the right side actually performed better than the left side In fact it was generating 6-7% more revenue than the left hand side So can someone tell me why the right side basically performed better? Sorry, one at a time, yeah, sorry Okay, so he says that if you put it on the right hand side A shoe of 10,000 rupees, 10,000 rupees will look cheaper Yeah, probably, any other answers? Great, yeah, so he says that it's a store mannequin And so yeah, so basically, so I think you know Probably, I mean, so we have to make our guesses based on what's happening And these are very reasonable arguments, right? So the main point that I want to make is that in the web world Only the right result matters, right? It's like searching for the needle in the haystack, right? You get that one single right result and people are happy, right? Whereas in the e-commerce world Not only the right result matters What you show around them also matters, right? The collection matters, right? So and remember, right? Shopping is all about comparative decision making, right? So to take a cue from what you said, right? Let me call these products as mannequin products, right? So why do I call them mannequin products? Because these products are great to look at And they take prime real estate even in the offline world, right? So right at the entrance of a store You will have these mannequins Wearing the best dresses and accessories that's available in the store Probably the highest-priced version of all the products And basically it induces buying of similar products, right? It makes you enter the shop in the offline world, right? Similarly, these products induce buying of products from the same designer Or shop or style at sometimes a lower price point, right? So let's try to design how we will build an algorithm To identify mannequin products, right? So let's see how we do that So take every, this is a simple algorithm, right? So take every product bought And look at all products that have been viewed Along with the product With a bought product in the same session, right? So during browsing people will look at end products And then decide to buy a product, right? So take a product and aggregate all products that have been ever viewed along with that product, right? So this is what we do So there's a red shirt And while buying the red shirt The user basically viewed a red polka dot shirt Then some shorts And then a yellow shirt, right? And then there was another user who bought a blue shirt And also looked at a gray shirt And the red polka shirt And a black t-shirt, right? So now which is the mannequin product? Exactly, the red polka dot shirt is the mannequin product So here's a tricky question, right? So how do you implement this algorithm? So what would you do to implement this algorithm? Sorry? So what's the infrastructure you'll use to implement this? So basically, right? Like this is a very simple example, we took two products So if you have to compute mannequin products across a merchant, right? You obviously have to look at all sessions, all products bought And then you know aggregate along it Along all of these products to get the top performing mannequin products, right? So what is the distributed computing infrastructure you'd use? Sorry? Basically you'll write a map reduce code, right? So in the mapper you can get what are the products brought along with each of them You invert it and in the reducer, right? You can take the most frequently occurring products Which will turn out to be the mannequin products, right? Okay, so let's say that we've basically gone over a bunch of issues And sort of covered how we address each of these issues, right? So let's look at bloom-reach ranking, right? So hopefully like we've learned from these problems and we've taken the insights And we've implemented these algorithms, right? So I thought let's check whether our algorithm works, right? So here's the, this is basically bloom-reach ranking The query is Prada shoes, right? And these are the results that are shown, right? So what can you infer from this result? Sorry? There's no right or wrong answer, so you can please read it Sorry? Yeah, couple of them are pre-orderable, but that's fine Like so they will, they'll get it in time And people don't mind buying pre-ordered products on the human markers Variety, okay? So let me, like one thing is that, you know, actually all of them are sort of IEEL pumps, right? And in a certain price range, right? Like $750 or even higher, right? $1350 or $650, right? Like these are all higher priced products And most of them are pumps, right? I mean, in some sense, there isn't much variety like salt pumps, right? So, and let's take the result for the query shoes, right? So can anyone see any pattern in this? There's more variety, price reduced actually, right? So what I thought was, I wanted to analyze So let's say we said that we will rank our products by celebrity, right? So every product is supposed to have a celebrity score, right? And then once you retrieve the relevant products Pretty much the ranking has to be in the order of celebrity So I thought, let me look at what are the Prada products, right? Prada shoes in this result set, right? So here's a Prada shoe on the left-hand side And here's a Prada shoe over here, right? So the fourth result and the fifth result are Prada shoes And are these the Prada shoes that we saw on the previous page? No, so this is a loafer and this is a sandal And shoes that we saw, sorry Were all like platform shoes, IE shoes, right? So if we were ranking, right? If all through this talk I've been talking about celebrity and stuff, right? So if we have implemented the ranking the right way Then shouldn't we be seeing basically the same result, right? So is there a bug in our ranking? That's a question, right? Now, as you pointed out, one thing to note over here is that The products over here are all in discount, right? Everything has reduced price And the Prada shoes, right? The partial order has not been maintained over here, right? So these are not the products that we saw on the other page, right? So why is this happening? Is it due to a bug or something else, right? So to understand this, yeah, so just to highlight Like the product, the price of these Prada shoes is $400 and $200, right? And over there we saw everything was $750 or $1300, right? So to understand what's happening, right? So I decided to understand what's happening Let's analyze what's the average price at which people buy products When they issue a certain query, right? We call that the average order value So what's an AOV? Basically what's an average order value? It's the price at which people buy products after issuing a particular query So it turns out that Prada shoes have an AOV which is close to $700, right? So when somebody queries Prada shoes and go on to purchase a Prada shoe They are willing to spend on an average $700, right? But when the query issues, people actually spend only $300 So as I said before, most of Neiman Marcus users are actually brand conscious They're all very rich and they're not at all price sensitive So if someone is actually going to type a query without a brand on Neiman Marcus It means that they are first of all very price sensitive And their only intent to buy on Neiman Marcus is to show off to people that They have bought a product on Neiman Marcus Most of the queries on this site is always prefixed by a brand name, right? So in other words, right? So what did we learn from this, right? So ranking cannot be independent of the context You cannot just rank products by celebrity What you need to have in place is basically a context aware ranking, right? So basically with this I conclude my talk And what are the takeaways from this, right? Thank you So the takeaways are the corpus may be small, as in the case of e-commerce But the ranking is complex and hard And let me say this, I know maybe there are a few Googlers here Even Google has struggled on product search historically Like I know somebody who's followed what Google has done on the product search They started with a product called Google base It was called Google product search, then Google shopping It's a hard problem, right? Even for a company like Google And you need to learn from web scale data Not just the corpus And then there is a lot of signals to incorporate Celebrity, seasonality and freshness Then it's important that you track all user behavior Including impressions The critical in ranking as important as views and clicks are And you need to have context driven ranking And there are unique challenges Basically a collection is more important than one right result, right? As we saw with mannequin products So I mean one thing I just want to highlight Like anybody who comes and talks about e-commerce search, right? They always focus on personalization, right? I haven't even talked about personalization, right? Even without personalization there is a whole lot of stuff that you need to do, right? To first get the ranking and then if you add a personalization layer Depending on how much you can learn about the user, right? That's its own vertical actually, right? So with this I hope that I've given you sort of insight into What happens in building a e-commerce search engine And some of our experience in building the product over the last three and a half years So thanks to Fifth Elephant for the opportunity And if you have any questions I'm happy to answer Just a quick announcement The owner of the car KA-03 MKA-242 Needs to move his car ASAP And also KA-03 MT-7008 Any questions? Questions? One quick question is with respect to Blumer I mean the search how do you do Does it include the comments As you said personalization What the whole entire world is trying And you stop that's the great thing Other than that does the comments for each product specific Like for celebility we had multiple tuples as you said Price etc etc Like compared to a document in web you gave a good analogy Does it include the comments also For example the shoes when you said Which includes the comments for the product People used to type comments So does the celebility vary with respect to the comments in Blumer You mean reviews? Reviews yeah Not in Facebook or other social media On the website itself Does it include or you have avoided it So reviews is generally a signal that can tend to be spammed Spammed both by the merchant And you know sometimes by the user also right One angry user can go and try to spam Or the competitors can spam it But we have far more accurate signals right The proof of the putting is in the buying right We know whether the product is bought or not We have a pixel so when a merchant ties up with Blumer We give them a pixel which is integrated in every single web page Of the entire website right So we know every single transaction happening right And I think that's the biggest signal right Reviews you can use it when you don't have You know other signals that can be substituted But nothing is better than you know Final buying behavior right Which we are able to track True but have you included and you had a trade off Or you didn't even include it So for example page views right Is a kind of analogous So we have not included a review because We know that it can be spammed Because the merchant himself adds a few review here and there You know for many reasons SEO reasons and you know product selling reasons And all of that so we have kept reviews away Right yeah Thank you Any other questions We have time only for one more question Hello Yeah so I actually have two questions So number one is how do you resolve disambiguation In cases of queries like Let's say silver jeans In which silver jeans Let's say silver could be a brand of jeans And silver could be a color Now in this case a silver brand Brand might be a better search result But in some other case you know Silver jeans would be a better result So how do we go about resolving this On the run time Number two was you know so you mentioned That for the brand you are working with Average order value was a key metric That you know helped you decide for shoes You know the pricing should be different versus A specific Prada shoe query Was it a business insight Or did you go about you know figuring out For shoes maybe price sensitivity works Or for a t-shirt maybe something else works Yeah so okay so let me first answer The second question right so I mean the way you come to that conclusion Is simple for every query Look at the products bought The basket value so there are a class of queries Where they purchase for those queries Is really high right So let's say somebody was going to Make a certain query and they are going to purchase You know buy something for thousand dollars right You need to get your ranking right Absolutely right of course the volume also matters You know in the Google ads world There used to be certain keywords You know which is like legal keywords And the bidding for those keywords used to be Really really high because if that guy comes in He's going to make a million dollars the lawyer So he'll bid it for like fifty dollars Hundred dollars all of that right For that click right I mean there could be a spam click too Right so what we do is we analyze queries And we bucket them into different categories right Some queries will be high volume Low revenue some queries will be low volume High revenue right and It's the final revenue that matters for the company For the merchant so and then So it's all automated so we do nothing Which is basically manual so the The algorithm takes care of you know Identifying these different types of queries And optimizing on that right And And on your first question right In terms of query understanding So a lot of The ranking depends on the performance of products Right So let's say we We have the ability to Sort of do query understanding And take brand out so we can make it an optional Parameter and When we show both these products right we Can pretty much learn which products are getting bought Right which products are not getting What so with that we can come To a conclusion that okay this actually For example we have a similar query which is And red valentino shoes and red valentino is a brand And valentino is also a brand So does the user want valentino shoes that are Red or they want the brand The shoes from the brand red valentino right Like so we've been able to resolve those Ambiguities in our ranking Does it answer your question