 Hello everyone. Good morning everyone. So, welcome to the fifth elephant. I hope you don't have to fight through too much traffic to get here. And it looks like some people are still doing that. But I hope they'll be here soon. How many of you are attending the fifth elephant for the first time? Oh, okay. That's quite a bit. So, I'd love to tell, in case you already don't know why this conference is called the fifth elephant. Do you know? Are you curious to know? No. Okay. So, I'm Shreyas. I'm going to be your emcee for today. So, the fifth elephant is borrowed from a book by Terry Pratchett called the fifth elephant. It's part of the Disc World Series. And in this fantasy world, the world is essentially a flat disc which rests on the back of four elephants. Okay. And the legend is that there was a fifth elephant at the genesis of the world which sunk into the world and became oil. And that fifth elephant has been missing for a really long time. And since data is the new oil, this is why it's called the fifth elephant. Right? So, our first talk is by Akash from Flipkart. And just to give some context, I guess, like an experience I've had that probably a lot of you have had is when you buy something from Amazon or Flipkart. It certainly happened for me on Amazon recently. And I bought an iron box and I was recommended an iron box again. And I was like, why? Why are you doing this? Because I'm really interested why the nuances behind why when you buy something, the same thing gets recommended. Because I'm sure there's smart people working behind it but there's reasons why that happens. Like, what are the nuances between recommending compliments for a product? So, Akash will talk about that a lot more in this talk. So, let's hear it for Akash. Hello everyone. Good morning. So, I wish to share an interesting story which happened with me a few weeks back. So, I was walking along the streets of Kormangala just pub hopping. It was some late hours on the weekend when I got a call from my manager. So, it was very abnormal and it was very an odd hour for getting a call and it was quite unusual. So, reluctantly I picked up the call. So, my manager on the other side is like, Akash looks like something's wrong. One Indian girl is all over recommendations. Now, being a bachelor, I was curious like who this girl is. So, anyway, I asked him for more details like what girl are you talking about and he replied back that there is this new release by Chetan Bagat one Indian girl which is like all over recommendation widgets on our website. So, yeah. So, that was one of the things. So, I immediately rushed back home and like booted up my laptop went on to browsing. So, I picked up this book GRE for dummies and as my manager was saying this one Indian girl started like was surfacing somewhere in the middle like looking out of place. Then I went to this book by Shashi Tharoor again. Again, one Indian girl was here as well. So, I was like very confused and like what went wrong and I started debugging but then it struck me that okay, there was this campaign marketing campaign which Flipkart had launched which was selling this book for say one rupee when bought along with other books. So, yeah. So, probably that was the one of the reasons why like this was appearing all over where it shouldn't have. So, clearly this was an example of our ranking gone wrong. So, in this talk we will look at some of the relevance and ranking aspects how we do relevance, how we do ranking what might have potentially caused such a such an error and yeah. So, before that let me first tell you what recommendations are. So, say you go to an offline salesman you go to an IK showroom and you ask the salesman for say running shoes of certain size. You specify your need in terms of a certain maybe price, maybe a certain type of color maybe your purpose say running shoes and he comes back to you with a bunch of suggestions. So, he not just uses your direct query instead he applies the context as well. So, he checks what season is it a like winters or summers or is it like raining. So, that will help him prune down. Also, he might look at the build of your body shoe size for getting back to you some suggestions. So, in the online world this widget is solving that in some respect. So, in the online world we have some context of the user maybe previous purchases or it could be the browsing history of the user and on top of that we know that the user has landed on Puma shoes and we want to recommend a set of similar products. Right. So, here again there are these size and color variations that we offer and different suggestions based on a bunch of criteria. So, some of those criteria we will cover as a part of today's talk. Yeah. So, okay. So, I'll I'm going to structure my talk like this. So, first I'll cover about the relevance bits. So, what is relevance different techniques towards relevance. You have content based on collaborative techniques. Then I'll go into the ranking how we can rank products optimally given the context. How do we create an architecture which allows us for quick experimentation. What features do we use for ranking and finally I wish to leave some time for Q&A. Yeah. So, let's first jump into what relevance is. So, given a context, right in the previous slide the context was the shoe in question. So, is the suggestion relevant? So, we want to ask such a Boolean question to the recommender system, right? So, the question is does the question, does the candidate which we are recommending cross the certain relevancy threshold. So, how do we define such a threshold? So, here is a graph showing a precision requirement versus the context requirement. So, something like homepage of Flipkart, right? User has not typed in any query. He just visited Flipkart. So, the nature of context is quite implicit. We have, we might have some history but he has not explicitly said what he is looking for. On the other extreme, we have a very explicit context of a search page. So, search page users typed an actual search query and in that case, we have to live with a very high precision. So, we can't deviate much from the search term. So, if the query was in Ikea sport shoes we have to be within the limits. We can't show, we can't deviate much from the required suggestions. The product page is somewhere in middle. So, product page is the details page which we saw. It's somewhere in the middle. So, nature of context is moderate where we are not as lenient as homepage and as strict as search page. So, the precision requirement will be also moderate and that will help us decide what threshold of precision do we want to play in. So, there are a bunch of techniques majorly divided into content and collaborative techniques which helps us figure out the relevance. So, obviously, I don't want to recommend a shoe for something like a mobile that is totally irrelevant. But within shoes what how do you define relevance? So, let's get into the collaborative aspect. So, first let me tell you what collaborative filtering is. So, say on the left side you see a category hierarchy. So, this can be any commerce hierarchy that we might have. So, for example you can start with a footwear or footwear on the top. Below footwear you could have shoes, sandals within shoes you could have casual shoes, formal shoes. So, it's basically a taxonomy that you help user in the browsing of the thing. Browsing across Flipkart. Then we have user activity. So, user when he came to Flipkart, you could have a yeah, the user browsing through the website then you can add to cart add to wishlist or perform some other activity finally leading to a purchase. So, all that will be coming under the umbrella of user activity. So, once we have this category hierarchy and the user activity we want to perform this collaborative filtering to give a set of relevant suggestions. Right? So, let's see what is inside this block of collaborative filtering. Right? Yeah. So, here you see three books which are there. There is an introduction to algorithms. There's computer networks and there is 5.1 someone. Right? So, the Venn diagrams here represent proxy to the volume of these visits to these and the Venn diagram overlap will represent an extent of overlap. So, it's quite intuitive that more the extent of overlap more related those suggestions are. For example, like for algorithms networks might be nearer than say 5.1 someone. However, there is one nuance to it. So, that nuance being the popular products might end up coming in most of the suggestions. So, something like 5.1 which is like top seller will creep into all academic books, fiction books, non-fiction books almost everywhere on the website. So, we have to discount the overlap by the popularity of the product. So, cosine similarity is one such measure which is used wisely where intersection count is kept on the is normalized by the popularity of the individual products. Right? So, cosine the extent will give you an extent of overlap and will tell you whether these suggestions are limited related or not. So, here we saw that there are these three individual products where we mind some kind of patterns. So, we said that algorithms is nearer to computer networks rather than 5.1 someone. It need not be at individual product level. So, take this example for the case of saris. So, this slide might not be fully visible but let me read it out allowed. So, you have this in the center you have a fabric type of a saris called synthetic georgit. Right? So, there are around 150 or 200 fabric types of saris we sell which basically help user to down the decision but it now if you see synthetic georgit using collaborative filtering and user basically user activity data we can mind these things like pure georgit is somehow related to synthetic georgit. So is art sale. So is chiffon. So, this attribute this gives us kind of an attribute graph which we can work on top of and basically that extends our product to product similarity that we saw in the last slide. So, this example which you saw is constrained to a single category. So, we had saris as a category and within saris we were able to recommend few attributes. So, but there is no reason just to be restricted to a category. For example, during the recently concluded like World Cup one of the patterns that we saw was this team Ronaldo surfacing cross diverse kind of recommendation. For example, this decal laptop decal is from the electronic accessories category. So, if we do a category cut we won't be able to surface such a pattern but since the user activity can be across categories it can be very broad we got such pattern. So, there these wallets and t-shirts which were from the fashion category were showing somewhere down the electronics category. So, to if we apply some rules like fashion should be shown only within the context of fashion we will miss out on such patterns and hence it might be important to balance out the category mix with such kind of patterns which are coming directly from user activity. So, till now we saw how user activity can be used towards mining patterns. Another important aspect to derive recommendations might be using the product attributes. So, here we start with some product attributes we apply some kind of attribute similarity and we get a set of relevant products. So, what like let's get into what attributes similarity is right. So, there are two TVs here there is a BuTV there is a KodakTV they have some bunch of attributes which are defined some are visible here like there is price there is offers there is some kind of warranty there is a screen size here some are not visible to the users. So, I have listed down some of these attributes. So, there is screen size brand whether it's a 3D or not HDMI cable all those things. So, some of those attributes will match and some will so more the attributes matching more similar those two products will be now there are certain things with respect to attribute those attributes might not be complete. So, those are the nuances we have to deal with what do we do with for case of incomplete attributes also all these attributes are not equal in the sense. So, like let me take a quick show of hands so we have these screen size brand so how many of you think screen size is more important for making a purchase decision of a TV okay and how many think brand is more important yeah I see a very divided audience so yeah like we don't rely on this intuition to do this judgment we one of the ways to derive what is the relative importance of these things is looking at the data itself. So, we have these filters which are present on for screen size brand for any TV kind of query right so number of people kicking screen size is a good proxy of the real importance of screen size similarly number of people who are clicking on the brand filter will tell us that brand is more important than screen size so that is one kind of way where data can be used to power the relative importance of these features. Another important thing which is the search queries itself so if people are typing more of 42 inch TV kind of queries then 42 inch is a more important criteria in the mind of the user while making a TV purchase yeah so here we saw the catalog structured attributes another thing which can be used is the image itself so image of the product say we start from the image of the product we apply some kind of visual similarity techniques and again you can get a set of visually similar recommendations. So here we have a saree we want to our objective is to show visually similar sarees to the this saree so it can span across brands different price ranges but we want it to be closed in the sense of looking wise so we we train a neural network for identifying some such queries so we give that a query image a positive and a negative image so the objective is of this is to bring the query and the positive image as close to each other as possible and the negative image further apart so basically this neural network will try to learn a lower dimensional representation so starting from the raw pixels it will try to learn a representation of each of these images such that say this query and the positive image will be closer in that lower dimensional space rather than the negatives will be further apart and then we can do for this saree search for the nearest neighbors in that space to find the some proxy to the similarity so one other observation that we had while dealing with such data was since we are a marketplace there are a bunch of sellers which sell similar kind of products and when we first put this out model when we first put this model out into production what had happened was we had almost near duplicates coming up so in fact like exact same saree started surfacing in the recommendations which was given as a feedback back to the catalog that these are near duplicates these need not be different items so these can be clubbed into a single item with say multiple sellers and then that could lead to a better customer experience like rather than showing the exact same item again and again right so till now we saw like how user activity and product attributes can be used to compute some similarity metric between a pair of products right so initially what we had is we had different widgets for these so we had something like customers who bought this also bought based on say collaborative filtering we had attribute similarity a module for attribute like matching attributes a module for visual similarity so these we could afford to have different modules but recently we saw an interesting trend like with geo and all coming up the mobile traffic started exceeding so at this point like majority of our traffic comes from mobile so it may be mobile website or android and we might not have space for like separate widgets of visual separate widgets for collaborative and this so for the user we might want to combine these two widgets so like how do we get into combining them so like we had these relevance criteria we had collaborative filter attribute and visual similarity how do we combine them into a single ranked suggestion list right so let's see what we have at our disposal so we have first we have the user clicks so we showed bunch of these things to the user we got some click feedback so if the user is getting more and more it tells us about engagement so you have some module might lead to more engagement than the other so that could be one of the criteria to judge which of these is performing better in terms of user clicks right one problem with clicks which we saw at the start of the presentation which was like this one rupee book was leading to a lot of clicks and that led to that Chaitan Bhagat book probably led to that Chaitan Bhagat book surfacing all over the place because people are browsing it a lot out of curiosity but probably not purchasing it so another important thing that comes into picture is why don't you take conversion itself so not the engagement but the conversion so it's important to specify what you want to rank for so you will combine these but how to combine will matter will will be dependent on like what do you want to optimize for conversion or you want to optimize for user engagement so typical goals of a recommendation system so there's a tradeoff between engagement versus conversion also important are these diversity and serendipitous elements so user don't user expect the recommendation engine to show some surprising content one of the examples which we saw was this Ronaldo based patterns where people might be browsing they might be fans of Ronaldo which might be browsing products across categories with a unifying theme right so we use the learning to rank for combining these different relevant signals so it's a machine-learned model to generate the rank list what it does basically we have a set of items to recommend those might be obtained from any of the categories so any of the category of relevance so you have collaborative content based techniques which will give that thing and we have product conversion data so like positives will be much lesser because out of a lot of people that come to Flipkart many don't end up purchasing so a fraction of the people will be purchasing and that will be positives and there will be a lot of negatives in the product conversion data so we train our model to we pass those items to the model and which will help us for each of those models right so some so what we are doing in essence right we are taking those clicks and purchase signals and passing a feedback loop back to the ranking block so let's see what this feedback look like looks like at Flipkart right so we started with a linear logistic regression based model one important thing was to ensure the production sanity right so once we deployed this model there were a lot of gaps in the incoming data so we are relying on a lot of data that is coming in there is a click data conversion data and there is relevant signals some of them might get delayed or some of them might not be up to date so that sanity so how do we measure the performance of such a model becomes an important consideration so the scale at which we work like these are rough numbers so like out of like 100 million odd products we have recommendations of so we are confidently able to recommend a subset of them so around 15 to 20 percent of products is where we will be confidently able to say okay these are the relevant products rank among them the training point which we saw the conversion we take one month back one month window of data so the conversions of which happened in the last month from the recommendation widgets around order of a billion training points and then we have the scoring which is the pairs being scored right so for a pair we will have different scores so we will have a score from the collaborative filtering engine we will have a score from the attribute matching engine we will have another score from the visual similarity engine and they will come together to be scored by this ranking module right so one thing about this thing which we experience during big billion days right so we got a lot of traffic and as models started behaving quite abnormally during this big billion days so we were like curious like why is this happening right so apparently what we realized is the traffic that comes during big billion days is a very different kind of traffic they are looking for probably some deals and the way we are optimizing for conversion we were not taking these deals as a factor factor in our recommendation so like we initially started we treated this data as anomaly so whenever there was like a spike sale on our platform we just ignored that data another way to deal with this could be to adjust the training data with sale day as a feature so for example you could have these there are certain days in the calendar which are marked for us like there are they might be Diwali sales or new year sales you could take them as a feature and rank specifically for the sale kind of scenarios yeah so till now we saw how these relevant signals helped us rank the rank the overall rank the overall set of products right so let's see what other features can be useful to us so one of the features which are important in this respect are the quality features so what I mean by quality so every product on Flipkart we sell has a notion of quality so even in the minds of the user in the minds of whoever is purchasing there is a rating data that we have right so people rate all the products then there might be people who are writing some reviews so reviews are very useful in the respect that they can help us mine some information is it a positive review is it a negative review positive review tell us that ok people like this product so it's a good quality so the hypothesis here is the quality more the higher the quality the higher will be the conversion right so let's see so we started with this human labeled data so there was a category team which supplied us with a set of good products initially it performed quite well but as we moved to more and more as the data became still right so the category managers can't tag so they initially say gave a set of 1 to 2 lakh FSNs products which were which were human labeled saying that these are good products but these products are very hard to refresh it's a so we moved to the crowd source version which is the ratings reviews and the return rate so return rate is one another one another aspect if a product is getting returned a lot that seller or that product class of product might not be suitable for the because people are returning it a lot it's a huge cost on the marketplace yeah so these quality features have a sparsity problem because we are explicitly asking the user to mark a rating so very few of our products a small fraction of our products have actually have ratings and reviews in that case it's useful to fall back to the next higher level of aggregation so we saw the category hierarchy which can be used here as well so category hierarchy can feed into the model and just we can average it by seller or we can average it by brand so they say there is a poor brand for example relatively speaking right both will have a much lower return rate than say something like Skullcandy so we can basically penalize the entire brand say there is a new Skullcandy earphone when we are comparing with a Bose earphone probably people like Bose kind of earphones more and they will have a higher quality and quality will naturally have an impact on the conversion which we are trying to predict another set of features that we found in useful was the historical features so like all of us here are attending Fifth Elephant it's a very good example of historical features so all of you might have seen those feedback forms floating around so like do you know why those feedback forms are useful yeah so like what Fifth Elephant uses those forms is it gets the speakers feedback and uses that feedback for the next edition of Fifth Elephant so there is a feedback loop built right into here similarly in the e-commerce space these historical products historical features will be helpful to us right so one thing about historical features so they get stale very fast so there are these new product editions prices will change offers will change and hence we need to refresh these things very very quickly so initially when we were not refreshing it as soon as we started refreshing it gave a good conversion boost in our when we did a new update index update right so that's why historical features are like very powerful so historical feature will be something like say there is a brand HP brand and products from those HP brands are performing very good in contrast to Acer something like Acer brand will not perform very good so we might want to take that factor historically speaking if there are a good performing products rank it about the lower performance products so one other interesting nuance was about presentation right so we have different presentation channel so we have this desktop website we have android and the nature of the interface makes it very different so there are devices of different sizes that are available in the android market so all these presentation aspects matter a lot while making a prediction the ranking prediction right so on the left you see a screenshot of android which is a 2 cross 2 grid so your entire focus is on the android widget here this is a screenshot from the desktop website where you have a you have one widget but in the overall scheme of the page this is just one of the rows so users focus might be on multiple places in the desktop website right so what do you call an impression so impression is basically when the user is seeing your module so impressions are quite useful to measure the performance of the widget right so in case of android all of those four are most likely to be seen by the user on the other hand these six for the desktop setting might not have been seen by the user because user retention might be divided so we need to factor in all those scenarios while accounting for the performance so say android it will naturally get more clicks because of this behavior so android tends to get a much higher clicks because your entire attention is on your device on desktop we get very less clicks in that comparison so we have to normalize those impressions by the channel one other interesting thing which we found was the usage of the thumb on android right so you have you are scrolling on the android device and your most of us are right handers like majority of us so people tend to click the right sided positions much more right so the odd positions are clicked much lesser than the even positions it this is one of the findings that we had from when we analyze the position data for android for desktop we have a mouse click there is no such there is no such behavior which is observed on the desktop setting right so now how do we treat that is a becomes an important question like do we initially our thinking is we will deploy one model in production which will work for all channels but is that really true because user are looking for very different users are behaving very very differently in these channels so do we deploy two different models for different channels or do we check taking channel is just a feature in one our computation right so all those trade-offs have to be taken into account while doing this presentation trade-offs so overall right let's see what all we covered till now so we started from a bunch of input signals we had category hierarchy user activity data these attributes and images we applied a bunch of relevance techniques also fed in these features the quality historical and presentation features to a ranking module optimized it bases the clicks or conversion right so these clicks or conversion is what we used as optimization and gave us a final picture of the ranking block so these things were developed over an iteration multiple iterations so we started initially with few of those blocks some took multi month effort some took like years of effort to build so I tried to like color code this into this slide so the darker represents like more amount of time up to like years of effort the lighter was something we were able to achieve relatively faster so historical features although they look very simple to get them right and to analyze them at the right granularity of category is quite important yeah so this pipeline right so there are these bunch of pipelines running here so there is a user activity to collaborative filtering pipeline attribute similarity pipeline visual similarity pipeline so most of these are scheduled once a day in our case so each of those gets refreshed once a day and those features are prepared from the user activity data itself so historical history is again prepared from the previous data of the user there is presentation which is prepared from the feedback which is coming from those android and desktop channels quality are prepared again from ratings and review data sets right so those get refreshed daily and all together this training and scoring also happens once in a day so we have these job schedule which basically start from the raw data sources and compute and every day a new index gets pushed to production for our scenario yeah so that's it these are the references that I have used in this presentation and yeah I'll be happy to take some questions hello yeah of the charts demand guys can we have the people sitting on the stairs move to one side so that the mic runners can run let's run my question is studying the behavioural pattern of a customer based on the clickstream data that is real time analytics suppose the customer is new and he is clicking on your products and after that that is a real time analytics he may buy or do not buy suppose he buy he chooses some product in the COD what are the probability of the chances that the customer should return the product or not return the product and in case what is the success failure rate of Flipkart and even the customer do not buy what is the how is the classification done okay so you asked about what is the usual success rates right so I can't actually disclose the actual raw numbers yeah yeah behavioural pattern of a customer based on the clickstream data right so here we had so yeah so on the right you had clicks and conversion data right so the previous data is analysed here so what we try to do is the last 30 days of data is what is fed back into the models right so we already have the conversion rates of the user we already have the click rate so say there is some module which has 10% click rate right so 10% will be treated as positives and other 90% will be negative no I'm not talking about historical data yeah I'm talking about real time analytics in this case what we are doing is we are limiting ourselves to historical itself we are just taking the previous few days of windows which is sufficient for us to make a recommendation because the patterns don't change that much within a day so what we have figured out is the cost of doing real time is much it's much more than cost of doing batch so in our case batch was the right fit so we relied only on the historical like how the classification is done when the customer does not buy the product yeah in that case it's treated as a negative sample so if you buy it's a positive sample if you don't buy it's a negative so that's how you try to divide your data hello thanks you for your speech it was a great one so I have two questions primarily so first one is since you since Flipkart would have a large amount of data how do you schedule your jobs because collaborative filtering takes so much time probably assuming hours for your case and second question is have you ever tried supervised machine learning techniques for classification such as Exiboost or any other technique yeah so yeah so first how do we schedule our jobs so it depends on the data so downstream jobs depend on the availability of the data upstream so as soon as there is a ratings data is refreshed right so there will be a triggered job which will trigger as soon as the ratings it will notify the next job that okay we are ready to process because we have the rating data available similarly this feedback this feedback data is collected so as soon as the feedback data preparation will complete it will feed into the scoring block right and for attribute similarity it's just a bad job which runs so collaborative filtering yes it does take a few hours so we try to run it in a distributed fashion on a over a distributed Hulu cluster so yeah so these jobs are chained to each other so first job output will be chained to output to the second block and so on right and the second question yeah we haven't tried Exiboost and all other those techniques we kept access to a simple model which was partly interpretable as well which helped us debugging all these case studies and use cases I talked about like were because we relied on very very simple model which were quite explainable and we could actually debug the weightage of each of the constituent features good job nicely presented very simple question do you have an alert mechanism over here to identify whether the model is failing yeah yeah so it's very important to so in our case this model so there are alerts at each stages right so there are two kind of alerts that we have one is the volume based so there should not be a certain volume dip in any of these data sources so if you are computing collaborative signals it's based on user activity aggregates right so it can easily deviate across time to time example in big billion days there could be a huge volume coming in so we have those anomaly detections for volume right then model how do you measure it so this is a ranking model one of the metrics we use is AUC right so area under the curve we regularly monitor so there are two kind of so one is the train test area under the curve right so you measure this during the offline modeling and scoring pipeline right there is a online component to it as well so say you have say in AUC of 75 during your offline deployment right does it translate to online scenario so when the user actually came the next day were you able to predict the ranking how many times were you able to predict the ranking in the online scenario will tell us the contrast right so these AUC we have alerts over even if it dips by a single point we have alerts on the AUC and we look back at the model what we it has learned has the has any anomalous data creptin so those channel specific nuances I was talking about they were derived from this only so we had to segment data by channel to get a complete picture I just wanted to know how do y'all go for deploying your model because there would be large number of requests coming in in real time and the model has to serve those requests of course caching cannot be one of the great solution because it would be different result for different user right so yeah so like that how do y'all scale up the model for serving the request yeah so there are two kind of models here the one I talked about was a product pivoted model so product to product preferences don't vary per request so they can be cached so product to product you once you rank it entire day unless the offers are changing unless the price is changing a lot those don't deviate so these similarities between a pair of products don't change that much for user side of models we have a distributed service so if so still there could be those heavy users which are coming in they might be served from cash but there is a tail of users for which we actually make the model call and that does happen in real time so we have a hosted distributed service which is a centralized service which will help us do the prediction in runtime so usually those we keep try to keep it very lightweight so that that can be done within the digital latency sir please connect with me offline there are the people so when you give this sari example so you are going into the feature space and just see how much difference as they are right can you repeat okay so when you give sari example yeah so you are going to the feature space you see that how much are they related and how far they are right yeah these are you taking which feature to take into account let's say not sari a sunglass a guy wearing sunglass here so you are talking about this right yes yeah so let's say a guy wearing sunglass okay so a fair guy and a dark guy those features you should not take when you are comparing sunglasses so how you actually filter out which feature to take when I am taking in the feature space and doing that comparison over products yeah so there are like two answers to that so there was a traditional there was a earlier time when these used to be handcrafted features right so what feature do you give importance to there was this object detection block where you would detect that okay what is the foreground what is the background right and then you will say I will give more focus to the foreground in our case all that is not essential so what so if you see this query positive and negative right so what happens here is if you supply it enough so you supply it enough examples saying that the query is a say a white guy wearing sunglasses and the positive is a black guy wearing sunglasses right so it starts to focus on the sunglasses itself right it knows where to focus on so once you give this neural net enough examples of these kind it will help you learn these kind of patterns right so you don't need to explicitly code any of these features in the model right right right right oh oh so you are so in our case we had this catalog was quite clean so we didn't have such a composite image when we were with a real world setting that is much more harder problem so for our case like this was a very simple setting where see this catalog it's only a sorry there is no shoe in focus here right so similarly for sunglasses it will be only focused on the sunglasses we had a luxury of having that clean data set already otherwise you need much more preprocessing do people have more questions okay so this is a pretty awesome talk because we have more questions than we did before so before we let you have the questions our next speaker is Uma who is going to be talking after the chai break I like to ask her to like plug why you all should come back here for her talk hi all my name is Uma I currently work at LinkedIn but as part of my phd thesis I worked on something called as entity search whose aim is to really upgrade your search experience so all of your do use web search engines for our day-to-day work right so I am trying to I will try to give you an awareness of this way in which entity search hopes to upgrade your search experience and if you are a search practitioner I hope that you can take back some of the insights to you know improve your search so hoping to see you at my talk thank you thanks Uma so Aakash do you want to go forward with the questions people who want to go for the chai break can go but people who want to stick around are welcome to yeah so I want to sorry I just want to ask I know asking questions it takes a bit of like you know going outside of your comfort zone definitely does for me so in case like if you are feeling not too comfortable taking the mic and asking a question feel free to put it on twitter and I will try to read it out here on behalf of you so in case you don't want to like you know stand up and ask questions which is completely cool so feel free to put it on twitter and we will read it out for you thank you for the talk you mentioned that across channels you need to sometimes identify whether to use a channel as a feature or a different model entirely we mentioned the right hand thumb rule case as well so what are the factors that you usually consider apart from the cost benefit analysis like just say the technical analysis that you conduct to use that the channel as a feature or a different model entirely so yeah like you said the first answer is cost cost benefit analysis so running two parallel models and insuring so even to run this single model it takes a lot of effort so it was like multi quarter like almost an year it took to reach to the state just for the ranking model itself so maintaining two models is like quite hard so I will always go with having a simplistic model but in our case we started with the linear model so we might want to have the crossing kind of features right so the presentation features cross with the channel could give us then equal substitute of instead of having a separate model so if you are able to so what we can do is we can prototype like what performance these two individual models are giving is it far better than keeping a single model or a single more model with a multi layered model with some kind of crossing then that is the trade off basically so crossing is one of the answers which even we are evaluating. Quick question I don't see users buying behavior being modeled here every user probably has a different buying behavior and it also varies based on you know when you receive your salary or what not. So in the scope of this work this is like you correctly pointed out this is product pivoted right so most of these things are happening product to product relevance which has to be overlaid with the user centric model so we have a very parallel like a different model which deploys on which uses this as a background so this forms a backbone kind of our recommender system and then we have user signals overlaid so one of the things that you said user receiving salary in the start of the month so should we recommend more expensive products to the start with or should we recommend more value products at the end of the month absolutely so that model sits on top of this model so we have user to product and user to so we saw the category hierarchy right so we have user to attribute as well so what attributes you like and within those attributes what selection will you go for is a second level of detail which can be overlaid this over these two so in our case we have kept this user to product and product to product quite separate so that they are more interpretable and we didn't combine the user to product production in the direct manner that way there are alternate ways where you do a factorization and compute a single embedding for user and product single like in the combined space but in our case we preferred to keep it separate thanks for the informative talk so my question is very specific on collaborative filtering so it suffers with this cold start problem now there could be some products which are newly added to your catalog which you might show up in your ranking for some reason so how would you deal with that yeah so here we saw that there is a feedback loop that is comes in handy for that right so say we started showing so we have a lot of selection for some selection we started showing those newly added product right due to the feedback loop it will get penalized right so what we do is we have this click and conversion data nobody will click on the new selection next day so this thing is daily refresh so at most during the day it will get highlights and next day it will automatically penalize the content which is poor in performance so that feedback loop comes in very handy ok so what you are saying is basically how do you make new product surface at all right so yeah there are various techniques which are available for that so one of the techniques is called explore exploit right so what we are doing here is we are exploiting in the approach described we are just saying that whatever in the top of the list show that so usually the practice is to leave some kind of bucket right say 20 percent for exploration so for the exploration you just randomly choose some of your selection from the tail say beyond the top hundred list you randomly start showing that for the 20 percent users now again due to the feedback loop if that starts performing it will start surfacing up so explore exploit is another techniques which is used to bring that newer selection that is a problem because we have a long tail of products which have no activity so this is the ongoing problem we keep on attacking right so like explore exploit is one of the techniques we found useful for that thanks Akash my question actually goes back to that same thing we have talked about two three times even in the equation so you have user to product mapping and product to product mapping and product to product mapping is what you're caching for faster service and that makes sense the question is is the user to product mapping lost in all of this process when you come to the final recommendation I'll just give an example so for example I like hiking and I browse a lot of hiking products you like fashion you browse a lot of fashion products both of us as users go to search for electronic cameras is the recommendation system more likely to show me a GoPro and show you a DSLR does that come into play is the user feature retained or is it lost over this whole process yeah so yeah that's a good question so if you deal with products it will get lost like you rightly pointed out if you deal with product and product to product that theme will get lost but here we saw something regarding latent behaviors right this one so if you can capture latent concept somehow right so travel is a theme hiking as a theme so these are not dedicated categories on clip card right these are some kind of themes which unite so this is a football or donald cr7 kind of theme right which unites them so if you can identify such themes which are running across one of the ways to do that is factorization which I was mentioning where you represent them into say 20 or 30 dimensional a lower dimensional space each of these products and when you compress that into that space probably you are likely to find such kind of patterns more so that is a tradeoff between but then there is a problem with interpretability so in that case you are you might get some patterns which are like which you might not be able to understand model is saying it's correct but you might not be able to understand and hence yeah and hence there could be false positive as well so we have to be careful yeah hey thanks for the talk so I just wanted to ask that you talked about the model to basically rank the products okay so now from whatever you spoke about we have a lot of product level data as in you have the image mapping data you have the collaborative filter data but basically what does it rank on taking all that data so if it's a supervised learning what does it rank on what is that why yeah so on the right you see there are two metrics which we covered so one is customer engagement which is like how is the user clicking a lot and another one is conversion did this recommendation which it lead to a conversion so it's very important to specify which do you want to optimize for so at the start of the process what we start with is whenever we are building the block we have to first identify the criteria like you are mentioning so basically it's your product manager along with you who will help you decide okay I want to optimize for clicks I don't care about people purchasing or I care about people purchasing I don't want them to stick on my side that much or it could be a balance right so that balance will help us choose that trade off as in as data scientist you are equal partners so yes you have a say in our case we combined it's a combined process so yeah so along with product business and sir follow up offline please there are other people waiting thank you Akash my question is actually related to your team structure okay so one have you ever considered I could gather it has taken you a long time to build this entire recommendation system have you ever considered build versus buy decision and what you know what led you to build decision secondly how is your team structure to ensure sustenance of this model on an ongoing basis maybe one question and the second one you can take offline yeah a lot of people yeah so to answer the first question build versus buy yeah so at Flipkart most of the things we use we do rely on open source component most of the things are in build because once you there is a very when you buy things right there is a big loop involved so any corrections so like you saw the nuances right those are very hard to capture in an awful in an external crowd like outsourced scenario so that position scenario when you are sitting close with so for example for those I we had to sit with android team to understand like is it really true your click clicks happening on the right side on desktop where is your click what is an impression so all those are best done sitting with the even for the outsourced setting it's very recommended that those teams sit with you and do a combined effort rather than just focusing on model part instead of looking at the data seeing the distribution of the features so for our case we are mostly preferred with building simpler stuff rather than off the shelf we have not tried much into using the pre-built full solutions because they won't be understanding such nuances hi Akash so based on this recommended system once the conversion has taken place right how does the recommendation completely change so let's say I am been looking for a mobile and suddenly one day I buy the mobile next time I log into Flipkart what really happens behind the picture because it keeps still showing me a lot of mobile yeah so there are multiple things in picture here so one is you said that once you this is talking about product to product world right you are talking about the user world which overlays on top of it yeah so there are a lot of reasons why like you are saying mobile can be resurfaced so like here I said that optimization is on click and conversion there could be an incentive for us to people for example don't purchase their own mobile for themselves we have seen that people actual people purchase mobile and then again after a week they will purchase two more mobile so maybe they are purchasing it for their family so so along with clicks and conversion there are there might be business consideration as well which might lead to that also there could be delays so there was this pipeline right entire yeah so the entire pipeline there could be delays in the data which is coming in so all that is happening in batch right so there could be delays leading to that another thing which we were like discussing before the talk which is could happen is you could have so you could have the mobile might be placed under some say tablets category so that miscategorization does happen and in that case we are actually recommending tablets and not mobile as a system wise so all those kind of scenarios might lead to this abnormal kind of suggestion but yeah there might be a case to actually show such a suggestion because finally users are purchasing it again so there are those categories which do a re-purchase Thanks Akash tremendous response to your talk and I think we should close up now you can catch Akash offline he'll be available we need to set up the auditorium for the next talk so sorry for you guys who had the next more questions thank you some insights that came out of those so first question is what are these entities I am talking about simply put entities are objects of interest so those could be people like such as Tendulkar those could be locations movies monuments, phones, cars universities and so on any object of interest Entity search is a search engine that takes in a text query and presents results in terms of these objects so for example if my query is Brad Pitt movies I don't want to see documents I want to see a carousel over movies if my query is universities India again I expect to see a carousel over colleges and universities so in this case my entities are movies and universities research shows that at least a quarter of today's web search queries are actually seeking entities so if we can do a better entities search we can really upgrade the search experience what are these queries so some examples are Bahubali lead roles fastest ODI double century main oncology universities India and which phones have unsafe batteries now at this point somebody may ask aren't today's search engines already doing some of it yes they are but there are two main problems as we see today first the lack of coverage although as I said at least a quarter of web search queries are seeking entities they are still being answered in terms of documents if I query for main oncology universities India document if I query for which phones have unsafe batteries again document research the second main problem that we see today is lack of robustness for very similar formulation of queries I see entity results versus document research if I query for murder books Agatha Christie I get a nice carousel over books which is what I want but if I query for murder mysteries Agatha Christie I fall back to document and if you notice the actual results the actual entities I am looking for are mentioned in the top results yet they are not extracted out there so one question is why does that happen today so one main reason these kind of problems come up is because today's search engines are dependent on what is known as the knowledge graph for doing question answering so if you attended Professor Partha's you would have heard something about knowledge graphs so what are knowledge graphs very simply these are collection of facts about entities so they contain types and relations and entities presented in the format of a graph so here is one example here is a very simple knowledge graph it has two entities Prabhas and Bahubali these entities are of type actor and fill and these are connected via the relation actor so to give you some examples of public and proprietary knowledge graphs our Wikipedia is a type of knowledge Dbpedia which is derived from Wikipedia again another knowledge graph then there is freebase Yago Wikipedia Nell that is never ending learning that Professor Partha mentioned yesterday these are all public knowledge graphs that you can use in terms of proprietary knowledge graph our LinkedIn has its own knowledge graph of people and their skills and their jobs so on Google has a very nice knowledge graph Microsoft has Satori Facebook has its own knowledge graph of people and their connections and so on so let's get back to why do today's search engines do not perform well when it comes to antique research as I said most of them are dependent on these knowledge graphs for question answering however to answer the questions based on knowledge graph you have to translate the text query such as Bahubali lead roles into a query which can be executed on this graph so let's say it can be a query like select an entity E where the query is connected to entity Bahubali where the relation acted now this approach works best when your translation is right you get fast and precise answers the bad part about this approach is that when clause it's not always easy to come up with the right translation for your input query let me give you an example suppose my query is fastest ODI double century it's a very simple query if you think about it but if I were to execute this on a knowledge graph I'm assuming that there is a relation of cricketers and their double centuries and then I have to do a sort on it so for a search engine to do this on a large scale is difficult there are many tail queries like fastest ODI double centuries where you would not be able to do the right translation and that's where you come up with the problems of low coverage or lack of robustness and the second problem which also was mentioned yesterday is building knowledge graphs is a very hard problem so very talented people have been working on this problem building open source knowledge graph for many years right intuitively it's not easy to cram all the knowledge in the world into one knowledge graph which is why knowledge graphs will always have lack of coverage there may be missing entities nodes or relations and that's why when you do question answering on top you get coverage issues so our goal here is to create a dedicated entity search engine which will do better than this it will have good coverage it will be able to answer the queries as I mentioned sometime it will have good robustness and accuracy so that was the background I hope the problem statement is clear to everyone let's do a deep dive now on the technical details so we said that yes knowledge graph based question answering works yet it has some of the problems such as coverage and robustness how are we planning to do better than that so one obvious solution that may have come to your mind is why not we go and append some of the information to the knowledge graph so extraction is hard relation extraction is hard maybe we will mention the relation in phrases sure so there are many such approaches where you go on increasing the coverage of the knowledge graph at the same time your accuracy of the facts inside knowledge graph goes down so there are various method which work on different incarnations of this at end of the spectrum we thought why not go all the way suppose we will use all the information that is there in conjunction with the knowledge graph and what is that source going to be that is going to be the corpus of all the web documents so that's the main idea behind this talk we are going to use the web documents in conjunction with knowledge graph how do we connect the web documents to KG's to give you an example at the top I have the knowledge graphs as I mentioned before about films Bahubali, actor, Rabhas and Rana Dagubati I have some of the web documents here below for example one document says Bahubali leads, Rabhas, Shivendu, Rana Dagubati as Balala and so on now if you notice this document contains mentions of the same entities which are mentioned in the graph so there is Bahubali mentioned here Rabhas mentioned here Rana Dagubati mentioned we are going to connect the mentions of these entities with the actual nodes in the graph resulting in what is known as the entity annotated text so we have created these links how do we create these links using entity annotators so more to about that later so at this point we have a query coming in from the left Bahubali lead roles we want to have an entity answer in terms of entities and in between we have these connected information sources knowledge graph and web documents that is our entity search engine so how do we use these two information sources effectively so the first part I will talk about is how to match the query with the structured knowledge graph very simply the query is a collection of words if I want to answer that query in terms of entities the hints to what entities I am looking for are mentioned in the query in terms of entities types or relation my job is to extract those hints and map them to the nodes on the knowledge graph so how do we do this we first go and tag entities in the input query text so given the query Bahubali lead role I use an entity tagger to identify an entity for example here is Bahubali go and pin it on the knowledge graph at the second step I look at all parts on the on the graph starting with this entity so this is an example of one path there may be one path here one path here and so on so all these parts have entities types and relationships and these are the current candidate translations of the input query so my next job is to take all these candidate translations and when match them with my input text query how I am going to match them is I am going to identify the type matches and the relationship matches let's see that one by one so far as I said we first identify the entity mentioned in the input query so here is one open source tool that we use called tag me tag me is trained using wikipedia where since wikipedia already has words marked with their source documents which are the source entities it uses that as training data and trains a machine learning model now if you give any input text to it it will be able to identify entities for you so here is a snapshot of how we use tag me on the input text Bahubali cars, Prabhas and so on we first identify that word Prabhas is actually a link to the entity Prabhas so using tag me we first identified the entity then the next question came is how are we going to match the input query text to the types and relationships on the knowledge graph so for that we built our own convolutional neural networks if you are not familiar with them do not worry just think about this multi class multi label classification right so on the left hand side we have a query coming in query in terms of words on the right hand side your classifier is going to output scores for each type or for each relation so if one of the candidate types is actor in or actor it's going to have one score here what is happening in between is first the query is represented each query word is represented as a vector then there are some convolutional and pooling layers which give you a fixed size representation of the input query so these are feature representations of the query we also appended manual query and type world overlap features so we found that not just outsourcing everything to CNN if we can input our own simple tfidf type of features it actually works better so we did that at this layer we have our a feature representation of the query and then we have a classification layer sitting on top right so at this point we have an entity sorry entity's mouth in the query text and we are able to identify the match between the types mentioned in the query relations mentioned in the query with those mentioned on the graph so this was about query to knowledge graph match so just to make sure you are all with me I am just presenting the way of computing features that between query and its translation how is actual machine learn model is going to behave I haven't talked about that that is yet to come so just hold that alright so given a query we are able to match it with the knowledge graph at the same time we are we said that we will also use web documents for doing question answer so how is that going to happen let's look at that next as I said we have entity annotated web text so this web text has mentions of entities tagged with the original entity so this word pravas is actually tagged with my node called pravas in my knowledge graph the way this I do question answering based on this entity annotated web text is first I do I identify text snippets which have query words mentioned so if my query is Bahubali lead roles Bahubali lead cast I am going to identify all the snippets in which one or more of these query words mentioned so here is one example at the same time I am also looking for all the entities which are again mentioned in these snippets so one snippet where it all comes together is Bahubali cast so these are query words and pravas is an entity which is tagged so this is a potential candidate snippet my next task is since there will be thousands or tens of thousands of such snippets I have to score each of these so some of them may be false positives the words just came together by chance these do not indicate answers right so here we use again a CNN to identify which snippets are correct or identify a score for each snippet to give me an answer so what are the the internally the features that it is going to use is going to use the distance between the query words and the entity word and the tf idf score of those words so using all that this is the CNN that we ended up using this was taken from the research work done by Severin et al in 2015 so here is the text document which is the text snippet here is the query text so Bahubali lead roles will come here and here is the text snippet internally again is going to use convolutional and pooling layers to come up with a fixed size feature representation it is again going to use some additional features coming up with a total feature vector followed by your own classification layer so when we have this CNN is going to give us match between one particular snippet coming from the web document and the query text there are going to be tens thousand such snippets each possibly mentioning the same entity or a different entity we are going to sum up the scores over each to give us one final score for a candidate answer such as Prabhas to put it all together at this stage I have the query coming in from left for each query we found some translations on the knowledge graph so maybe type equal to actor is one maybe entity equal to Bahubali and relation equal to acted is another and parallel in parallel there were some text snippets coming in from the web documents each of them connect to one or more candidate answer entities and my job is to look at all this data and come up with the final answer now if you recall at this point we were able to compute the match between query text and the type or the entity or the relations as well as the web text so at this point we have these feature vectors sitting on these edges and these feature vectors are my type match score entity match score relation so how do we use these to come up with the final ranking is the question so if I had access to lot of wealth I could actually hire labellers experts which will take a query identify the best translation of that query I would do this for tens and thousands and millions of queries and build a machine learning model on top however doesn't work of course I do not have access to all that and secondly it's very difficult to come up with the right translation of the query sometimes you would think that actor is the right type but maybe the type actor is not connected to my answer entity maybe the type person is connected to my answer entity how is the labeler going to know that so there are all these questions which is why I decided to not create this kind of fine grained labelled data instead we went with weak supervision so by weak supervision I mean that I am not looking at right query translation I am just looking to identify the right answers for the query so given a query Bahubali lead role the labellers are going to tell me Rabas and Rana Daggubati are some of the correct answers there may be more but I am not looking not looking for a very fine grained translation such as entity equal to Bahubali type equal to I that is hard and how do we do question answering using this supervision we use something called as latent variable discriminative model so again do not worry about the map what this slide is telling us is that given a query a feature vector so my entity type scores are sitting here just choose the best translation at this point so my model is getting trained at every stage you have all the translations laid out assign some score to each and just choose the max scoring query translation and make sure that you find me a weight vector such that the right answer the positive answer entity scores greater than the negative answer you put it in pictures find me a weight vector this is my classifier such that the right answer Rabas scores higher than the one of the wrong answers may be Prabhakar with the help of the best scoring interpretation at that point and this process goes on iteratively and when it but since this is max greater than max kind of constraint this is a non-convex problem it terminates at a local minimum that's the best we could be this is the total objective function translation of what I just spoke in mathematical so that was about how we are going to solve the problem before we move on to the experimental section are there any questions wow right so the question is we did question answering assuming that Prabhas already exists as an entity in my catalog is that the right part so how do we going to how are we going to know that suppose Prabhas doesn't exist in my catalog there may be some other entities which do not exist how do we do question answering on top right so in this work we are assuming that at least the entities are correctly present in the knowledge graph they are maybe noisily marked in the web text so we are making that assumption this work will be able to help you if the relations are not marked well or if the types are not marked well so that's one assumption we are making to answer the second part of the question what to do in case the entities are not marked which this work doesn't handle so if you attended Prabhas talk today he was talking about never ending learning what they do is they rely on the surface forms of entities or types of relations so those kind of words where you would get approximate answers in terms of surface forms so you do not have to know that Prabhas is an entity this is a word which potentially indicates your entity and then there are some techniques which will be able to find out those words and phrases for you so maybe you want to look up Oren Hzioni's works or just look up Nel maybe we will take one or two more questions and then move on to experiment ok so I had a question like what are the embeddings that you are using do you generate them yourself or are you using some sort of database for the embeddings question is whether I am using some database of embeddings or whether I am generating so we actually used glove vectors to seed our embeddings but they were also tuned to our training data so those were warm started with glove yes we are back this one last question then maybe we will take more questions at the end yeah knowledge graphs are particular objective to one particular domain or you build knowledge graphs for each domain correct gave examples of knowledge graphs which were proprietary so LinkedIn for example is going to be a professional knowledge graphs right then maybe you are building a product catalog that is a domain specific KG related to e-commerce what we are working with are open source knowledge graph or open domain knowledge graph which are trying to capture information from all domains not specific to a particular domain alright so I am going to move on to experimental section let's finish this and then we can take more questions alright so I gave examples of few knowledge graphs earlier we are going to use one of those freebase knowledge graphs it has 29 million entities around 14,000 types and around 4.6,000 relations we used Clueweb09 as the web page corpus so this is corpus release released by cmvue it has 50 million pages Google took this and annotated it with entities releasing what is known as the FACC1 corpus so there are on an average 13 entity annotations per page for measuring the ranking performance we use the mean average precision measure sometimes we sliced the ranking took the top K and converted it to a set at that time we used F1 as the performance measure and this code and generally the project is available at my advisor's website you can take a look at this link alright we used four different query sets some of them were in keyword format some of them were in natural language format resulting in about in total 8,000 queries so the first question we wanted to answer was sure we went through all this trouble to add an additional source of knowledge which is the web text how are we are we actually improving the performance so we did a set of experiment where we use only the web text for question answering we use only the knowledge graph for question answering and finally when we use both so thankfully as we surmised we are doing better when we use both knowledge sources so here the red sorry the blue bar indicates experiments where we use only the knowledge graph the green bar where we use only the web corpus and the red bar where we are able we are using both so under different set of conditions different experiments with ran different models we use the when we always found better performance when we use both the knowledge sources some examples where why this works better if the query is Bahubali lead roles so suppose erroneously my model chose the type as lead alloys ok and it also chose web text where an entity solder of type lead alloy is mentioned and one of the query words role is also mentioned however there are not going to be too many snippets in which this is happening and also an important query word Bahubali doesn't occur anywhere near in this snippet so overall when you do when you run your model you have your features this kind of entity gets a very low score on the other hand when you have the right translation where roles is mapped to the type actor Prabhas is connected to the type actor and is also mentioned in web text along with other query words and then there are a lot of such text snippets this works this happens quite often which is why Prabhas gets a much much higher score than let's say a solder we also compared with related work there are many research groups working on similar problems these are some of the examples of recent works where they used only the knowledge that are for question answering so much of this work is about KG based question answering and also they depend on the syntax of the question that is the natural language grammar of the question to come up with the right translations so those approaches work well for queries where you have nice formulations you can depend on the grammar but it fails when it comes to the keyword queries which we see in web search usually so it turns out that in 3 out of 4 query sets our model was doing right and in the 4 query set it was still doing better than most of the approach alright some examples of wins and losses so we found out that some as I mentioned before some of the queries are easier to answer with web text so fastest ODIs are both the examples but presidents won on airplane or who was the first US president ever to resign these kind of queries sometimes these answers the answers are not present in the knowledge graph or sometimes landing to the right translation of the query which will give us the answer on the knowledge graph is very difficult which is why using web text we were able to do better on the other hand sometimes we found out that using web text was actually leading us to wrong answers so creator of the daily show it turns out that there are many too many snippets for the wrong answers coming from the web documents than the right answer so this is one of the cases where we felt that we have to strike the right balance between the knowledge graph and the web text just using it blindly doesn't always work so we now come to the final two slides so the insights that I got from all these experiences that one is question answering is a hard business we have made some progress but there are still many complex multi-step kind of questions which current models are not able to answer so to give you some examples companies that make fan less motherboards or cities where Einstein taught so here you have to go through multiple steps such as x is a mother board x is fan less and then there is company y which is making x so you have to go through all these steps to land to the right answer achieving this translation is difficult either on the graph or the web text so these are some of the questions which is still in future we found out through our experiment that both these web text and knowledge graph make very nice complimentary information sources because knowledge graph is structured and accurate and it has low coverage at the same time web text it has high coverage of course we are using all the data that is there but it introduces some noise so striking the balance between these two is very critical so one last slide this was all about open domain right what if you have a closed domain application so this still works for you so suppose you are doing corporate or enterprise search so you have corporate hierarchies of which teams which managers which peers for people maybe some keywords on what they are working on and you have in parallel their weekies and internal data which works as plain text you can just use both of these together to do a better enterprise search similarly if you are doing product or e-commerce search you have a nice product catalog but at the same time you have user reviews as well so again you can combine these two to do better search so that brings me to the end of the talk thank you very much these are some of the papers that have come out of this paper feel free to look at it if you are interested in more math or more details so we can take questions thank you Umar can we have a round of applause please thank you to the people who are standing at the back I am really sorry that you had to stand through the talk but I am really happy for you Umar that you had like an amazing house full house for today okay so we can take some questions now also if anybody is feeling reticent to ask questions you can feel free to like I am happy to read it out on behalf of you you can just use the hashtag 5th elephant okay questions hello umar yeah hi so is the order of the words matter in the current research what you had done so like lead role in Bahubali instead of Bahubali lead role right right so to a large extent so I started this work with using only keyword queries that time our purpose was that order should not matter at all right but then I came to the conclusion and in more recent years I figured out that keyword queries are not the only queries we want to handle we also want to take into account the query grammar when it is present so in summary the system should be able to make use of the grammar when it is present but not depend on it because we also have a good query so which is why we sort of outsource this work to the deep neural network where we are making sure that we provide nice balanced query data coming from both keyword and natural language queries hi umar yes so in using the combination of two methods is there a significant performance overhead that you face for search queries I guess not significant there is yes there is some certain performance overhead but thankfully both IR which is the document search has come along way of course we can make use of index caching and maybe pre-compute some queries we can do some tricks like that to make this practical hello yeah umar so I have one question so when you talk about entities so I think ontology also plays a big role there so you would have usually preferred term and like associated terms to build that right so one of the challenges when we try to adapt the public tools for building the you know knowledge graph what we often face is the missing terms along with the ontologies so do you think any of the public sources give that agility for us to you know add more and more entities is there any faster way because it's always a very time consuming process um sorry is the question um how can public publicly available graphs give you the freedom to add more entities yes so each of these public sources we use by the way semaphore is like one of the entities the same but it's like very time consuming when you want to add more entities right so is there any quicker way you think that can be handled so I have seen that mostly it goes through a crowd sourcing plus some sort of supervision kind of process so you would make your suggestion but that would go through a review to make it into the knowledge graph so yeah there is some flexibility but it's not very straightforward yeah Yuma my question is when we work with such knowledge graphs sometimes the timely manner of fashion means what happened in 2011 say some theory that has been proved wrong in 2015 or something so this type of thing like who is the president of us we need the time for that when I am searching so such queries how it can be handled with so we are assuming that that kind of information is available in the knowledge graph so to give you an example yahoo that's one of the knowledge graph we used earlier that has actually come up with these kind of time based facts so yeah or if it is mentioned in the web text we will be able to make use of it but we are not doing any kind of specific checks on top in our model to make sure that temporal facts are getting covered we are assuming that the knowledge graph will take that later just a second how many people have questions can you please raise your hand one, two, three okay three questions we can do that right so one right in the top vision part okay am I audible yahoo so can you elaborate a little on the weak super vision part I mean because I have come across this for the first time so I want to know what it is maybe couple of sentences and how what was the relevance what boost did we get out of it right so just I'll give you a quick answer because this is a very long question we can take it later by weak supervision I meant that you do not have very fine grain supervision of given this query this is the right translation this is the type this is the entity this is the relation that kind of translation is very hard to come up with what we could do is given this query this is the answer right so by weak supervision I meant that we are missing the intermediate step we are going directly from query to answer and which is why in our model it's also we are not taking that intermediate query translation as we are keeping that as unknown we are taking a max over that but more details I think we should discuss later so hi thanks hi so two part question one is you used web text to augment the knowledge graphs answering capabilities right so how would you compare that with an approach where you use web text to augment the knowledge graph itself and then ask questions over it the reason I'm asking this is because sometimes you know having a huge corpus of documents particularly in domain specific cases is hard right so web is kind of a happy case there that you have lots of documents that's one part of the question and in some sense related part is how do we make best use of like you know what parts or what types of queries are best answered with the knowledge graphs versus documents we'll take the second part offline for the first part yes there are approaches we do which do this we have the knowledge graph you would take some text or some description go and append it to the knowledge graph so this is not like you have the whole web text with you while doing question answering but just maybe some snippet high quality snippet the advantage you get out of this is you are only doing question answering on the knowledge graph so it's going to be faster you have a little more flexibility than before because maybe you added some semantics or maybe you added some new facts which you could answer but you lose the extra flexibility that you get when you have the whole web text with you so if you are thinking about this from head and tail queries there is a long tail of queries you are losing flexibility here you are doing better in the in the measured area do more people have questions we have one question here I'm afraid we are running out of time do you mind taking it offline is that okay cool so Uma is also going to be participating in the BOF session maybe not so there is a BOF session on women in data science at 2 o'clock BOF is birds of feather for people who don't know it's a peer discussion group so the next talk is by Amrit Amrit is right here hello hi good afternoon am I audible so good afternoon everyone I am Amrit Sarkar working as search engineer at Lucidworks and for next half an hour or so we will see how we can build analytics applications with streaming expressions in Apache solar a bit about Lucidworks we are a search enterprise company based out of San Francisco we have offices all over the world including Bangalore we have a product Lucidworks fusion which is built on top of Apache solar which drives search engines and also we provide consulting and support organizations who are using solar as their search or analytics technology so we will begin the discussion with the challenges one phase while building this application on near real time data we will introduce streaming expression which is a brief overview of the same we will categorize our expressions to sources decorators and evaluators and we will see some certain examples this particular talk is heavy use case based we will discuss some real life use cases from some simple to complex one or understand their performance complexity we will then introduce statistical programming these are some fairly new statistical functions being added in solar before listing out the references so building an analytics application on offline data we have number of tools and technologies already available you can preprocessor offline data to create certain views which can further be become an input for creating your meaningful dashboards the challenge arises when you receive constant updates and you need to refresh your analytics applications at regular intervals executing complex correlations and functions on unstructured non preprocessed data is extremely time consuming also to bring together an entire analytics application you are dependent on multiple tools a database a tool for preprocessing the data a data visualizer to build the dashboards which leads to higher maintenance cost so before listing out what is streaming expression let's get a brief overview of what Apache solar is Apache solar is an open source search engine built on top of Apache Lucene library it's highly scalable, flexible and provides rich search capabilities on text there is a distributed mode in solar called solar cloud where your number of solar nodes can reside on different servers which is being managed by Apache zookeeper there are other features like spelling auto completion highlighting available in solar itself in this talk we will discuss about streaming and aggregations and we have advanced features like learning to rank which leverages machine learning algorithms to better your search experience so the features on the right hand side comprise the parallel computing framework of Apache solar which is only available in the solar cloud mode in this talk we will discuss streaming api expression shuffling and worker collections while parallel sequel is a wrapper on top of streaming expression whatever sequel query you provide to solar gets converted into a streaming expression internally and gets executed implicitly starting with streaming api it's a java api available for parallel computation of map reduce and relational algebraic operations you create streaming objects through stream factory and you have number of apis available to perform different operations the open function emits the search results as stream of tuples each tuple is a tuple stream object tuple stream is the base class of streaming in apache solar source code and all the relatable streaming objects apis are available in the package or the apache solar client solar j.io to extend this capabilities of streaming to non java folks who are building their applications on ruby, python, pearl and any other programming language or script streaming expressions were introduced these are stream query language and the serialized format for the streaming api this expression can compile to a tuples stream object and tuples stream object can then further be serialized to streaming expression they can be executed directly via sttp to a solar j which can further be executed against an api stream which is implicitly defined in solar itself let's look at a common example here we are performing a full index search on a solar collection and retrieving some specific fields only if you look at the bottom we are executing this query against the stream api and the expression itself is a search expression which is taking the input getting started which is the collection name we have specified the zookeeper on which this collection is hosted localhost 9 and 8 3 in the queue parameter we have hashbacks we only want to retrieve those documents which has the keyword hashbacks in any of their fields and we are limiting the result set to year 2014 we are retrieving id and model name and we want the final result set to be in the ascending order of id looking at the result on the right side on the right top corner we can see the representation of what is happening that is we are fetching data from the collection getting started and performing a search on top of it the result set is emitted in the form of a json where we have id 1 2 3 in the ascending order with respect to their model names m and o the last couple of stripping expression is eo f ando file true with its response time the response time is the execution time of the expression in the solar servers itself this example is 12 milliseconds this expression can then further be divided into certain categories depending upon their usability we have stream sources which are the origin of tuple stream that is you have search and facet stream which can fetch data from a solar cloud collection you have jdbc which can pull data from a relational database and we have stats topics time series train which can pull data from a machine learning model which can extract certain features these sources are then wrapped around by stream decorators which performs the aggregations, operations and functions on top of them these operations are performed row wise let us suppose you want to merge to result set, perform an inner join select only top 5 or only want to select unique tuples or unique rows stream decorators are there if you want to calculate or add new field values depending on the existing field value on each row stream evaluators comes into action and they are operated column wise you can execute straight forward mathematical calculations like division multiplication summation and also can perform conditions statements like if else then so as I said in the beginning of the talk we will discuss some real life use cases here all the use cases discussed has been adopted by our clients at lucidworks in one way or other so first look at the data set we have here so let us suppose we have a collection where we have data of certain flights of their source city and the destination city and we want to determine the destinations which are reachable within single stop from new york that is they have a stoppage or a layover kind of a thing there is obviously a number of solutions which are possible of achieving this particular use case but we can in this particular example we will try to visualize this data in a graphical format so the inner nodes streaming expression on the distances collection will put the new york as the root node of that graph and it will gather all the immediate destination cities from new york this inner this destinations we are gathering here will become then the input for the sources for the outer nodes streaming expression such that we will get one level away from new york all the destination cities looking at the graphical representation right and look at the result set new york is at level 0 because it is a root node and we have another node bengaluru which is at level 2 as desired whose ancestors are paris and new deli such that you can fly from new york to new deli to bengaluru or you can fly from new york to paris to bengaluru then there is a very popular use case where we have a big amount of data in your collection or basically in kessandra anywhere in a storage device and you want to retrieve some relevant keywords from that so that you can get a brief summary of what that data represents significant expression in solar leverages leverages apache-lucine's rich text search capabilities and helps us to retrieve those terms I am using the data set and roll emails here which is a very popular training set for a data science problem and we are restricting the result set to q tim beldon that is all the emails sent to tim beldon only some extra parameters like the minimum documents those terms should be present are 10 not more than 20% of the documents should contain that keyword so that you can rule out the very common English words used in a communication helping verbs articles prepositions etc and we are also defining the minimum length of this term should be 5 looking at the result we will get this information in the in the descending order of scores depending on the relevance we have the term entities and the term which were used more number of times when anyone was sending an email to tim beldon now these were very straightforward use cases we discussed where we just leveraged one single statement expression and got our results let's try to let's try a bit more complex one so we have two data sets here on the right hand side we have the data of certain organizations who have adopted certain campaigns and we have a list of their impressions clicks and conversions looking at the data we have campaign 01 organizations 01 received 134 impressions 48 clicks for conversions and we have this information in weekly manner on the left hand side we have the currency or the cost a company or an organization incurred while adopting these campaigns that is for campaign 01 it costs 6600 units and we want to calculate real time analytics or some real time useful metrics on top of them we want to calculate conversion ratio CTR click through rate or some cost ratio convergence to currency cost so first we need to visualize this entire two data set at one place so that we can do those mathematical calculations we have to join the cost data with aggregated conversions clicks and impressions per campaign and for the simplicity of this talk we will be just restricting the data set to organizations 01 first we will execute a search stream expression we already saw example of the same we are restricting that result set to organizations 01 we are wrapping this search expression with a roll up roll up performs a straightforward group by a rolling up over a field we have specified the field as campaign ID and along with rolling up over we are calculating some extra variables like the summation of conversions impressions and clicks to get the aggregated data once we have this aggregated data we will roll up our we will wrap around our roll up expression with a select which will just rename this variables to whatever name we want to give in this example we have some parenthesis conversions as AGGRCONV and respectively we can rename it for impressions and clicks and finally we will fetch data through search expression from currency cost collection and we will join the result set of this search expression with the top level select we will inner join it on the field campaign ID this looks like a big query right let's look at the representation to understand what's happening here if you look at the graphical representation on the top right corner we're fetching data from weekly data performing a search on top of it rolling up over a field campaign ID renaming the variables to select parallely you are fetching data from a currency cost collection performing a search on top of it and these results that are inner joined in the result set we have for campaign ID campaign O1 the respective conversions are 41 currency cost 6600 clicks 259 and we have we have these numbers for campaign O2 and campaign O3 as well now since we have the entire data at one particular pace we can visualize data at one single frame we need to just now calculate some mathematical calculations so we will wrap around our entire query we discussed in the last slide with the select and perform divisions so to get our current conversion ratio we will divide our aggregated conversions to clicks to get our CTR click through rate we'll divide our aggregated clicks to impressions similarly for cost ratio we will divide our currency cost to aggregated conversions we look at the graphical representation from left to right you will understand how we are forming this query from the beginning how we have read so far looking at the numbers I hope it is visible otherwise I will just summarize for campaign O3 CTR is 0.36 which is the best in the in case of these three and the other two ratios conversion ratio and the cost ratio is the best for campaign O1 as compared to the other two so we are done with discussing streaming expression with one phase where we discuss the use cases we can implement let's move on to discuss how complex they are when they get into execution so whatever we discussed in the last slide we calculated some metrics now we want to store this metrics in a separate collection as part of report it can be monthly biannual annual or a weekly thing too so you have an update expression we will wrap over the entire expression we just created in the last slide and we will specify the collection we want to index which is in this case the collection name is report I have specified the back size to be 500 we have only three rows there for campaign O1 O2 O3 this back size 500 is just for significance that if you have a result set with millions and millions of docs it will send the it will index those document in the back size of 500 looking at the representation from left to right we are now up till the update expression and if you look at the result set we are indexing three documents as we calculated and we have something called worker here all the use cases discussed until now are being executed by single node or single worker only there is no parallelism being introduced until now right and assuming there are n number of rows being involved for executing this expression our performance complexity is big o notation of n there is a concept called shuffling in streaming expression now as I just said we in the use cases we discussed until now only single node was used while executing this expression so the request will be sent from a client to a stream handler which will forward this updates to the respective workers we can define more than one workers for a streaming expression execution I hope everyone understands what the solar collection is it represents the entire data set this entire collection can then further we divided into logical slices called shards and each shards can have multiple copies called replicas in this case we have five shards one to five and two replicas each such that we have total ten cores which is representing the solar collection now these workers which has received these respective streaming expression query will then fetch data randomly from this course such that worker one can fetch data from shard two replica two worker three can fetch data from shard one replica one and worker five can fetch from shard three replica two respectively the other workers and every time you execute the same query they can fetch data from a different shard from a different core now we have can have a use case where we want to send the correlated subset to a single worker only suppose you have a data set you have a field category you want to send all the documents with category signs to one of the workers a worker one you want to send all the documents with category mathematics to other workers a worker two and correspondingly for other workers such that you have correlated data at one space and when you perform those mathematical calculations the numbers are calculated right so this particular concept been called as control shuffling and where we can specify that particular parameter that send a particular subset to a particular worker such that each worker will request each shard in a collection and only retrieve those particular documents. Now these workers we discussed are basically part of a worker collection these worker collections are just a regular solar collection in a solar blue out cluster they can be part of the same cluster or they can be part of an entirely mutually independent cluster and the goal here is to separate out data processing from data fetching you can have a primary collection which is hosting your primary data and you can have another worker collection residing somewhere else which will fetch the data from your primary collection and perform the processing so that you can have multiple servers doing multiple things. So as I said worker collections perform streaming aggregations and they will receive shuffled stream from the replicas and they can be empty they can be created just in time or they can be a regular solar collection which is which are hosting their own data of its own so I hope everyone is still with me on the use case we discussed two slides before we calculated some metrics and index to a separate collection now we want to implement the same use case this time parallely using n number of workers. Now this is a big query we discuss until now the update until the update we form that query which will be wrapped around by a parallel whose first parameter is the worker collection name in this case worker itself and we have defined the number of workers we want to leverage which is three this worker collection is being hosted on the same zoo keeper the primary collections are and I want the final results set to be sorted in the ascending order of campaign. I discussed something about control shuffling so for calculating the numbers in this use case we want to send all the documents of campaign or one to one of the workers or two to the other worker or three to the any other any other available workers. So we have a parameter called partition keys which you can define in the stream sources in this case we have two stream sources a search which is fetching data from weekly data another search which is fetching data from currency cost and to specify the partition keys as the field name you want to partition the data on which is campaign ID. Let's look at the result. So now on the left hand side we have the representation where we are can see three parallel execution of that update expression getting accumulated or aggregated on a parallel node and on the right hand side we have the result set worker shot to replica n4 is responsible for only indexing one document there is another shot replica which is responsible for the other one and finally shot for replica n12 is indexing the last document. Now we calculate the complexity of this expression obviously some extra amount of work needs to be done to aggregate these respective results from each worker which we denote them as Z3 while we assume that there are total n number of rows being involved for executing this expression which now will become big o notation of n divided by w where w represents the number of workers we are using here since the first parameter is almost negligible as comparison to the other in terms of complexity we can safely ignore it and we can state that the execution time of the streaming expression has been improved by the number of workers we are using. So not now we discuss about statistical programming in solar which these are some really new statistical functions being added into streaming expressions so now solar rich search text capabilities can be combined with in-depth statistical analysis. You can now do co-variances, correlations, euclidean distance k nearest neighbor graphs you can represent your graph in multiple formats through statistical programming which are backed by Apache Commons math library. This statistical programming syntax are used to create arrays from list of tuples so that you can transform manipulate or analyze. Let's discuss a use case of actually how we can use this. So these are some real stock market data from February 2013 to January 17 and I have abstracted their names so for stock price for company A on 1st February 2013 the closing points were 30. As a stock B on the same date was 168 and correspondingly we have this information for stock C and we have this historical data for 4 years and we want to determine here the correlation among the stocks from the historical data. For a given time frame if the stock prices for company A is going up at the same time the stock prices for company B are going up then they are positively correlated. While if the opposite is happening that is the stock prices for company B is going down they are negatively correlated. So let's determine the correlation among the stocks A to B first. So there is a late expression which allows us to set variables within an expression itself and it outputs a single tuple. Then we will implement two different search expressions here. The first expression will fetch data from the historical stock's data collection and will restrict the result set to stock A and assign it to the variable name of its own name stock A. The second search expression will limit the data to stock B and assign it to the variable of the same name stock B. This stock A and stock B will then become input for a statistical function called COL. This will take COL stock A closing points. It will take all the closing points from stock A result set put into one single array and assign it to the variable prices A. Similarly, it will pull all the closing points from stock B and assign it to another variable called prices B. And at the last, this prices A and prices B will become input variables for another statistical function called CORR which will implement Pearson product moment correlation. The Pearson product moment plots an XY graph of two variables and analyze how parallel they are to each other. The more parallel they are, the value will be more closer to one otherwise minus one. Now let's see the result. So the stock A to stock B correlation is 0.999 and so on. Suppose a new CEO is going to be appointed for company A or a new brand ambassador is going to be announced and you anticipate a rise in the stock prices for company A. You can make a very bold prediction that the stock prices for company B will also go up and down. The correlation between A to C is minus 0.18 and so on which is moderately negative. The trend of the stock prices of company A has nothing to do with the trend of the stock prices of company C. So the final take away from this talk is streaming expression in Apache solar allows us to perform complex correlations on map reduce and statistical functions which can be executed parallel using n number of workers on dynamic subsets which leverages Apache solar's rich tech search capabilities with these subsets can be faced from various data sources, solar collection, databases, etc. And this entire expression can be executed in near real time making the analytics application in a real time possible. So these are the references and the knowledge based article listed for this particular talk all the use cases and example are uploaded on the GitHub handle to stream solar. Check out the official documentation of streaming expression and statistical programming. This talk is heavily influenced by Mr. Joel Bernstein's blog who is the creator of streaming expression in Apache solar and I've also listed out some presentation leaks from last four years of relevant to this topic. That's it from me. Thank you so much for being here and I think I have decent amount of time left to answer some questions. Thank you Amrit. Next questions. One, anybody else? Two, yeah, you can start. Hi Amrit, it is a fine talk here. Hello Amrit. Hi sir. It is a fine talk. Now I want to ask you suppose we are working in a real time situation where real time data is coming from some monitoring part through Kafka. Yes. And we are just getting the information and putting it on the dashboard. Yes. And that time how Apache solar works. Suppose for some part of the day we want some streaming expression on a particular interval of data. Yeah. So how Apache solar fits in this scenario. Yes. So you have an option of specifying time series data. You can restrict your time frame if you have a field called time which is in a data field type. You can specify an entire time frame from this time to this time. It will be a start. There will be an end and because of solar rich search capabilities it will fetch those particular documents from the given time frame very quickly fairly quickly and then these expressions will work on top of it. Hello. So you mentioned that you mentioned that this can fetch data in real time. Yes. So let's say I have to join two data sets if the data sets are pretty big it is still going to take time. So for example if I run a high query it will still take time because data sets are big. So what's the difference between let's say running a query on high and running a query on solar. Yeah. So first of all in the challenges part I state that when you had I want to have a dynamic subset right you want to fetch a dynamic subset from one data set your big data set you have another dynamic subset from a second data set first of all solar capabilities helps us to retrieve those documents second depending on the number of workers you are using right you can always this performance of streaming expression execution is linear to the number of workers you are using. So I'm not going to go into very much details about like the physical course from servers and all but for 25 million documents you can execute these joints and perform this analytics analytics in a more less than a second. Those are the benchmarks we've done now depending on how Hive is comparison to the solar obviously Hive has their own advantages solar has its own but here you have the liberty to expand your solar cluster as many nodes as possible and which will linearly will improve your performance. Yeah. Amrit I have a question. So it's kind of very obvious question because so when you see solar the other tools that come to mind is elastic search and let's say the others so I have worked with elastic search quite two three years back so it has quite changed from that point of time. So when you and both are very comparable like I was doing a Google search and people don't have any opinion which one to choose. So what do you think like in all these use cases that you described I think possible in elastic search also. So which one to pick like any any performance and differences or some other future differences which are present. Right. So yes we can do all that I think in elastic search while I will restrict that there are 70 plus statistical functions already available in the solar. I'm not that confident whether they are available in elastic search as of now. Also see both are open source search engines but the community development how big a community has been built around a particular project is also very helpful of solving certain problems. Being of comparison elastic search is yes more seen as an analytics tool. Solar is being more seen as a search tool to build search engines while the inclusion of streaming expressions in solar now effective analytics applications also possible in solar on top of the search which already provides. So there is a BAF session on solar itself and this is a very broad topic you bought because we have to now compare feature to feature what's possible what's not possible so yeah. Okay just one quick question so there is a case check as kibana right so is there something equivalent to kibana in the solar. Yeah it's called banana and it's a fork of kibana itself. Okay and also I work for lucid work so we build this dashboards to there is an app studio which builds on top of solar or the fusion product we have and it's kind of like a competitor for kibana or banana as well. Yeah so the term near real time so for example the earlier example that you gave company idea how it is performing based on click same data so let's say I want the numbers up to last five minutes so first question is do we need to run that update expression every five minutes. So you have to I'm like for the last five minutes if you want to retrieve the information. Until last five minutes. Yes so we will first provide the information about that you want only documents of last five minutes you will execute the expression sorry the execute the expression will get your numbers okay you have to yeah you have to rerun your query every five minutes. Exactly and the second question goes on the same example we are looking for click to conversion until last five minutes from the beginning right so you are writing into probably some target collection yes so when that collection is being updated do I still have access to that probably old data versus new data because it's a simultaneous operation right some rows might be updated some might be having stereo data how do we handle that So this is a concept called time series collection being introduced in recent versions I'm not sure whether you are familiar with that or not so time series collections now can host data depending on a particular regular intervals so you really don't overwrite the data obviously you need to create your unique ID I'm going to do very details of solar now that how you can don't overwrite your existing collection right so you need to while forming those queries so let me let's quickly go there right so in this case yeah in this case we have this campaign ID only you can introduce a unique ID so that they don't get overwrite in your respective final collection so when I was building this talk we were introduced very complex queries but then feedback came and we wanted to say dumb down this query so that every can and understand so those if you when you're building this query obviously in the division part here right the aggregated flex can be zero right something will be divided by zero is undefined so you need to make sure that those checks are in place and you have a unique ID otherwise check out time series collection in solar which will absolutely ameliorate suffice your use case here we have one question from the front so I just have one question like for unstructured data like production logs if you want to run analytics solar is fine but you pointed out few other use cases like stock market data and which is kind of structured so for that I mean Sapacha already has another project Druid which is really good for real time analytics you can roll up you can build dashboard and time series yeah so why prefer solar over druid so I'll be really honest this is an alternate right it's an alternate of a solution I'm talking about advantages of each I'm like application of why solar is better he mentioned about elastic search we have to really get into the intrinsic features of each feature now building analytics application here is fine but again that aperture project you mentioned whether it can retrieve effective dynamic subsets from a large data set right it can fetch but this also I'm like yeah for structured data I'm like you can look into that you can look into solar do performance benchmarking and then you can do the comparison but if you have unstructured data then you have the solar so yeah answering your question this is an alternate not exactly this is something which will rule over all other projects that would be the best answer for the unstructured yeah I totally agree solar will dominate yeah but for structured you have other options too if you have any more questions what if it's going to be doing a bof at 245 great I there's an announcement that uber sales force and walmart are giving away prizes so in case you want to try your luck you can answer a quiz after lunch we have a talk by vacation from swiggy who's going to talk about how they're handling demand so see you on the other side of the lunch break thanks yeah so I hope it's clear to everyone yeah I guess so hey everyone I hope all of you had lunch did everyone have lunch yes awesome I know there are long queues there but I hope everybody got lunch next we have a vacation from swiggy who's going to talk about how they handle demand and I'm sure a lot of us use swiggy and really curious about how they do this one announcement which is there's a bof session for those of you who don't know what bof is it's a birds of feather discussion session at 245 and that won't be recorded right that's going to be a lot of fun to attend and it's going to be upstairs so a vacation whenever you're ready it's yours can we have a round of applause for a vacation and also to wake us up from lunch thank you I hope I'm audible to everyone okay great so I'm vacation and I'm a data scientist at swiggy those of you who got a chance to attend the last session by Nathan will probably have an understanding of the way in which recommendation systems work at swiggy and how relevance in ranking is being done what I'm going to talk about is sort of the other side of things the delivery system like how do orders get delivered what are the issues that are being faced the topic is serviceability under high demand which refers to the need the requirement to maintain reasonable serviceability even when the customer demands are high okay so I'm sure most of you here are familiar with how swiggy works it's basically an online on-demand platform where the customer orders food from the restaurant subsequent to which delivery agent is assigned to the order who goes to the restaurant and waits for the order to get prepared if it already hasn't been mixed it up and then delivers it to the customer that's the order cycle okay imagine what the ideal scenario for an on-demand company on-demand delivery company like swiggy would be to make things a little bit more definite imagine there were infinite resources both unlimited on-demand delivery fleet as well as complete information about the future let's not discount that complete information about the future right then we would be able to service any customer ordering from any restaurant at any time of the day in any situation rain or shine thunderstorms no it doesn't matter in any sort of patterns the unpredictable patterns like traffic and promise a reasonable amount of time for delivery and ensure that the order gets delivered within that promise time is sort of the ideal thing that we would hope for unfortunately we have to deal with real world constraints right we have finite delivery fleet we have unpredictable scenarios like a hinted bad weather or a competitor server is done which leads to a spike in customer demand to our platform even on normal times leave out bad weather even normal times the customer demand is significantly variable and so are restaurant preparation times part of it is because like restaurant preparation times aren't carefully instrumented and also because restaurants don't provide clear information about say how much time they require to prepare a particular type of dish right in addition you also have variations coming from actual delivery agent for the same say restaurant customer pair because delivery agent who is more familiar with the neighborhood would be able to take shortcuts and get to the destination earlier and this happens sort of more often than you might think and even if it's not really shortcuts like if you're just more familiar with the neighborhood I'm sure from your own experience of like getting from point A to point B if you're just more familiar you'd sort of know say which lane to be on and stuff right so it just makes things different there's capability coming there as well and then the fleet itself is only partially on demand I mean if we know in advance that we need something say a week a week in advance then yes we can probably get there or get very close to that but if I need like a sudden surge and say delivery fleet in the next half an hour the chances of actually reaching the target are quite slim okay so it's partially on demand now what are some ways in which so we can address these real world constraints now I'm just going to go through every one of them they're going to sound a bit disconnected okay and I'll sort of expand on them as we go along and hopefully at the end of the talk you'll be able to see how these things sort of lie together but here I'll just very briefly describe them the first step here is of course that you know we're talking about situations of stress we're talking about situations of high demand but we need to sort of quantify that preferably with a single metric quantify the load on the delivery system that's like the first thing then we can represent at any given time the undelivered orders the orders that have entered into the system the orders that we've accepted orders we're obligated to fulfill but we haven't delivered yet treat the undelivered orders as a queue and just as any other queuing thing you have an inflow and outflow you have new orders coming in you have an outflow of orders which is basically the orders getting delivered right so you use queuing model abstractions on top of that then of course as you can all imagine and I kept referring to it earlier as well you need predictive models predictive models everywhere predictive models for the orders that are going to come in say the next 10 minutes of time predictive models for the preparation time of the restaurant predictive models for say the time it takes for the delivery agent to go from pick up the order from the restaurant and actually deliver it to the customer so we also need real time strategies to reduce demand that's the whole point so we have serviceability and you have demand and you want to sort of reduce the demand and with everything else we'll get into the details in a bit right but it's not just about reducing demand it's also about which can I reduce the orders how can I allocate my demand how can I shape my demand right not all orders are created equal maybe I might want to prioritize certain orders based on say which restaurant or what the location of the customer is what the location of the restaurant is so it's also but intelligent demand allocation okay now so that you know you sort of so that I can sort of intuitively formulate the problem statement let's consider somewhat hypothetical scenario let's say that the orders could get arbitrarily delayed let's say I give my ourselves we give us the permission for the orders to get arbitrarily delayed principle then all incoming orders can be accepted and they can be delivered in due course now the simple reason for this is because even if you consider like a peak time like dinner right so imagine between 7 p.m. and 9 30 p.m. but there's going to be a surge in demand eventually the peak is going to you're going to go past the peak and the demand is going to come down and then the orders have been backed up over time will eventually get delivered right through the night so you can certainly in principle deliver all the orders right if you if you have the condition that you don't care about how delayed they get naturally the problem is that this creates a terrible customer experience because a lot of the orders are going to get delayed immensely when I put it in these terms when I when I structure it in these terms then naturally the question then becomes that I want to take just as many orders real time right I need to make real time decisions about the number of orders I want to take in the next interval such that they get service reasonably right and like I said also be smart about which specific demands to take which specific orders to take okay so let's give you a sense of the difficulty involved in this thing so so let me just point this out like I said we need to limit the orders that's clear we need to somehow choke on the orders at some at some level so that we can the ones that we accept we can deliver it within reasonable time now for the orders that we do accept the question is what strategy to be used so that we can deliver it within the promise times so one you want to limit order so that the ones you take you deliver it within the thing promise times and then once the order actually comes and what strategy do you actually adopt so that that becomes a reality right now on a temporal axis one proceeds to you need to first get to one the filtering stage and then you get to two the solution to one depends on the solution to two so if you were to think purely at an order level in a certain sense of particular dependency and you can't actually even solve it but that's only a theoretical thing in reality of course you use approximations and assumptions specifically you think in terms of averages and you think in terms of fluctuations about the averages you talk about say error margins and how error margins will affect decision making okay so let's study this thing that I talked about what is the notification of the load right what is the metric that captures a load on the delivery system one straight forward metric is what you see there yeah so it's just the it's just the ratio of the number of un-delivered orders divided by the number of delivery agents I'll call it the stress ratio right now before we sort of study this a little bit further it's intuitively appealing because if you keep the number of delivery agents constant and increase the number of un-delivered orders you naturally expect that the system is under greater stress likewise if your un-delivered orders are kept constant and you increase the number of delivery agents then the resources is better spread out right I mean the orders are better spread out among the resources so you're under lesser stress so it's kind of it has an immediate intuitive appeal now one very important thing is that everything that I'm going to talk about everything that I have talked about and everything I am going to talk about deals with scenarios where s is greater than 1 we do certainly have situation where s is less than 1 but for all practical purpose that's a trivial situation when s is less than 1 in most cases the serviceability is pretty good we're able to get the orders in the promise times and yes even there we can do optimization we can improve but the gains from that are much less the real problem occurs when s is greater than 1 and most of the times when you're ordering at say weekend dinner or even weekend lunch you're talking about s significantly being greater than 1 and everything else I'm going to talk about will concern that scenario there's less than 1 you can think of it as a trivial case okay now we have a stress factor let's also say that we somehow arrive at an s max some threshold on the stress factor beyond which serviceability drops below acceptable levels now let's not let's set the question of how we get to the s max let's just say that we have some s max for a given area and we want to make sure that s doesn't breach s max so let's see how that can how we could possibly do that I want to take a quick detour now to sort of queuing a model abstraction it's really simple it's just very straightforward but we'll be sort of referring back to this like I said your undelivered orders kind of is the queue that's the undelivered orders now consider interval of time delta t a short interval of time delta t and let delta a be the number of orders coming in and delta b the number of orders getting delivered within that interval of time delta t now if you change the number of undelivered orders it's nothing but the inflow delta a minus the outflow delta b just simple arithmetic right so it's a change equal to change in inflow change in outflow the problem is that we want the stress factor to remain under s max so you can think of a situation where s is approaching s max so that's when you want to like throttle the orders but we need to be careful about what it is we are throttling and when we are throttling say imagine that there's a customer who spent a lot of time looking at the home page, biggie home page, browsing through the restaurants finally decides on the restaurant and then consults with four or five of his friends comes with a bunch of items and then adds it to the cart when he is about to make a payment something flashes on the screen saying that the order cannot be processed because say all the delivery agents are busy it's a terrible customer experience so the added problem is that you want to throttle the orders but you don't want to throttle the orders when users are there just about to make the order at least you want to minimize that if you cannot completely eliminate it you want to really minimize that so the thing then is that you want to apply this attention, whatever your throttling intervention is, upstream before you get to that state you want to apply it preferably at listing it's much better that the customer didn't see the restaurant in the first place then see the restaurant and then add items to the cart and then discover that well this order can't be processed so now let's kind of like dive a little bit deeper into this thing so imagine that you are at 90% of the yes max, your threshold stress, you don't want to cross it, you are at 90% of it you want to do something, you want to take some real time strategies the thing is that how much orders should we throttle the orders and for how long and in general this is quite a difficult question to answer but here are a few facts that will sort of help us in terms of arriving at a solution or an approximate solution this is that the delivery fleet is more or less constant and the delivery rate is also almost fixed is a constant, this can be kind of shown analytically but also empirically you can see that this is true and you can think of that as some sort of quasi-static assumption that over the period of time things are not changing so dramatically that the delivery rate is changing fact number two is that if order acceptance rate the inflow is equal to the delivery rate outflow then your stress levels don't change which goes back to the queuing thing your undelivered orders is equal to change in undelivered orders is inflow minus outflow the inflow is equal to outflow then your undelivered orders don't change and you are assuming that in that quasi-static assumption the delivery fleet doesn't change and so your stress fact also stays the same fact three is that if you can actually predict and say this is at 8pm and if you can predict what's going to happen between 8pm and 810pm for that particular region you can specifically predict the number of unconstrained orders and I know what I need to actually how many orders I can take based on the delivery rate because the delivery rate is fixed determine purely a deterministic function of the on an average deterministic function of the fleet size so I know how many orders I should accept in the next 10 minutes and if I can predict the number of unconstrained orders unconstrained being that I didn't intervene I didn't use any strategy then I know what fraction of the orders to accept which sort of naturally brings us to demand prediction so predict the number of incoming orders in a rolling horizon of say 10 minutes and as you can imagine the features that we would use are order rate in recent times but also like the order rate at a similar time previous week because if you look at the order rate it's going to go like this. It's going to reach its peak and it's going to decrease at say dinner peak or lunch peak without even knowing any data you sort of intuitively can imagine that's going to be true more or less. So if you want to know what's going to happen between 8 and 810 you want to know whether 8 to 810 is that point where it's still increasing on an average or it is at the point where you already crossed the peak and you're sort of going down and of course if you clearly have established trend of going down at the point then of course you probably may not need the historical data so much but then if you want to know where the point of where the changes of the inflection is then you certainly need to know the historical data. So you also use historical data at a similar time in a similar zone in a previous week maybe like a whole bunch of other several weeks put together in addition you also want to know sort of what's the activity on the app or the browser or whatever platform people are using to order. You want to know how many people are there enlisting because then these are the people that will actually well the screen seems to have gone blank. Okay so while this is getting fixed yeah so you sort of want to know like what the activity is and then you take that also as a feature and in order to predict the demand. Now one crucial thing is this time t. I told you you want to predict for the next interval of time ten minutes and this is kind of a crucial thing now there's a trade-off involved in this prediction of time t on the one hand I want t to be large enough so that the numbers the raw numbers are going to be higher simple thing right if I'm taking a longer time more number of orders so my error as a fraction of the number of orders is going to be smaller. The reason being that statistical fluctuations for larger numbers is going to be smaller. Some of you understand say for instance if you think of this as a Poisson process then if n is the number then the standard coefficient standard deviation is root n and so root n over n is one over square root of n so then it becomes larger your error is a fraction decreases so you want to keep large so that your fractional error is smaller but at the same time we want to work under the quasi static assumption right so you don't want t to be large enough the situation the ground is completely changed specifically the number of delivery agents has changed dramatically or that you have a new you have like you know sudden thunderstorms that you're encountering cool so that's it so t is kind of there's a trade-off there between how much how accurate you want to predictions to be and how much of your quasi static assumption you sort of want to maintain or to retain. And in general it depends on the properties of the area. So far I've told you like you know whatever what have we done so far we have said that here we get multiple orders we have some stress factor and we get a threshold for the stress factor we don't want the s to cross the stress factor to cross that threshold and so we are kind of throttling the orders and then we're throttling the orders by predicting what's going to happen next 10 minutes and then sort of limiting things that listing or which we will sort of discuss maybe in a little bit more detail but now the question is which orders do we actually limit and which ones do we allow right should we treat them all you know on equal footing. Now you know a standard approach in these kinds of things is that you use some sort of a customer segmentation you divide your customer base into sort of some set of segments like based on say revenue based on like loyalty like the premium customers they're not so premium customers the people who order frequently the people who don't order frequently whatever you do that let's it doesn't matter what exactly you do you you have now some sort of segments and then you reserve some sort of slots based on your whatever calculations it did earlier and then you say that you know we allocate that to the premium customers and then to the next category and then when they fill up you move on to the next one and so that way you sort of like prioritize your higher people who are higher up in the segment that's one thing now a problem here is that like I said we're talking here about errors right you're talking about errors in demand prediction now if I were to implement something like this it also means that I have to predict not just the total demand I have to predict the probability the composition probability of each of the segments in that upcoming demand and because these are composition itself is going to be smaller than the total fraction of errors are going to be greater in each right so that's something to keep in mind an alternative approach is that you just you make your decision purely based on the location of customers specifically the restaurant customer distance let me let me rephrase this is probably a little bit opaque let me rephrase this imagine you have restaurant right and for every restaurant there's a radius there's a radius over which any customer who's within that radius can actually access the restaurant can actually see the restaurant on the thing what I'm talking about here is that you shrink the radius and so you have two concentric circles with about the restaurant and so the customers who are sort of lying between those two consenting circles can no longer place the order then naturally depending on the distribution of customers you're going to sort of total the demand because just because based on how these are distributed of the radius you're going to have fewer and fewer customers actually going ahead and placing placing the order but then there's something else over here there's something else that happens over here so we go back to the original queuing model abstraction right so you change delta n equal delta a minus delta b clearly delta a is getting affected you're reducing it you're reducing the few people order but there's something else now from the restaurant's point of view right the average time it takes for the restaurant to customer travel time has reduced because you've shrunk the radius that average travel time is reduced what this means is that the average cycle time has reduced average time between ordering and delivering has reduced which means your delivery efficiency your delivery rate your throughput increases which means that this action not only reduces delta it also increases delta b so it has it it works at both ends it decreases it throttles the orders at the same time increases your delivery rate and when you apply it to an entire area this this effect is actually it's actually really you know it's very much measurable it's not just sort of a theoretical nicety something that we can actually determine okay cool right so we formulated the problem as take as many orders as so that you can serve as them right that's kind of how we've gotten to this point but then of course we need to also understand what has been the loss in orders we've applied some strategy and we've gotten to a point but we also need to understand estimate how much has been the potential loss in orders to the counterfactual scenario where you didn't even impose such a strategy now again this this problem is nothing but is equivalent to estimating the unconstrained order flow unconstrained order flow over the entire interval say you apply the strategy for two hours and you have a certain order rate you actually measure but you want to know what would have been the unconstrained order rate had you not done anything now in general estimating that unconstrained order flow is quite difficult I don't want you to get confused between this unconstrained flow and what I talked about for the 10 net interval that unconstrained effort to the application of the strategy for the next 10 minutes I'm talking here about you applied a strategy and you're talking about the entire the unconstrained order flow for the full interval of time this is rather difficult estimate the problem is that even something like AB testing is difficult because it's just way too many variables let me give you a more understanding why this is difficult right now typically what do you do with AB testing say for example in this case what would you like to you would ideally like to pick up another similar area or you want to pick up a similar day a similar hour of time or hours of time or interval of time and you want to look at how the order looks like you don't apply any strategy then you look at apply a strategy on a different day you compare it leaving aside the fact that you know on the day that you don't apply a strategy a lot of customers are going to get pissed off let's even look at that slide from purely from understanding making a comparison if you want to think of AB testing you want to have all the variables more or less similar and that's the standard understanding think about this here your variables are not just things like say some fixed set finite set of things you're talking about not just mere average you're talking about a shape or talking about a curve a trajectory for the order rate you're talking about a trajectory for the number of delivery agents so for two things to be compared your trajectories have to be similar very similar and they may not necessarily just depend upon some say f minus f prime absolute values average right it might also depend upon the peaks the sudden spikes that occur so establishing sort of like similarity of regions over significant interval of time when what we are measuring and differences we're seeing is of the order 5% and 10% we're not seeing differences of say 50% where okay fine we're okay with very different things because you know if it weighs a little bit but if our numbers are very different we can still make a statistically significant claim but then no we are talking about things that probably weigh by 5% 10% probably 20% okay so so far I've talked about so the demand side of things right the order flows and then throttling them and then selecting them and so on so forth there's also the supply side control the restaurant parameters right and I sort of hinted this earlier as well say that you you know you sort of want to limit the orders but then you want to limit them in a specific way which is a function of the restaurant leave out the restaurant customer distance for the moment but other things for instance the preparation times this is also mentioned in the earlier talk when you have restaurants with high preparation times and you have a lot of orders say all you have restaurants that have a lot of orders that are backed up you want orders to shape away from that right so you can basically deprioritize those kinds of restaurants or not take orders and not show them at listing and towards those places or temporarily shut them down temporarily shut them down for like a half an hour or something so that you can shape the demand towards restaurants with shorter preparation times and again here you need to understand and quantify the tradeoff between potential order laws and increase in the delivery efficiency of the system and the second part is is very important again because again restaurant preparation when you when you when you do this reallocation it not only does it ensure that you're having better service what it again does is that your cycle time is reduced you have shorter preparation times so your order gets completed quickly your cycle time is shorter your delivery rate increases so it ends up like carrying your queue faster okay so then of course there is the heart of the thing right everything I've talked about is at the center of that all of this lies the assignment algorithm right what is the procedure that is used when an order comes in to assign it to a delivery agent what is it that we're optimizing is it like do I need to optimize for the average order to deliver time being minimum or should I optimize for the 90th percentile of the order to deliver time being minimum or do I optimize for say the fraction of orders that exceeds the promise time of delivery by 10 minutes or do I take a completely different approach so I don't know if you're sort of one common thing in the talk is that formulating a problem itself in this case is part of the challenge right requires significant thinking and effort so but then at least you know that the things that we need are again the preparation times you want to be there when the order is prepared but not much earlier because in the delivery agent has to be waiting there right and then the restaurant to customer travel time and the time taken to travel to the restaurant itself once the delivery agent delivers an order you need to go to the restaurant for the next order for the next pickup then of course there is the batching you basically batch a couple of orders two orders of the same restaurant nearby customers you want to group them together for the same DE pick up both deliver one and then go ahead and deliver the second so you can imagine even again without actually getting into math I think you just imagine how much more difficult that becomes right you now have to have two different orders delivery agent picks it up both orders should be completed and once it's picked up and sent to the first and delivered it should be within compliance within the within the promise time and then travel time to the next one and it should again meet that requirement that it's there in the in the in the sort of promise time and in some algorithms you might want to make that assignment while the DE is still completing the earlier trip so you should understand the complexity of the problem then of course there's error propagation understand things like error propagation how much is errors in preparation time affecting delays and decreasing delivery efficiency okay so then there's this whole host of issues and I've sort of addressed some of them which is probably some of these which seem like a repeat like I said there's a lot of variability there's a lot of stochastic component to the variables that limits the predictability and that's something that said pretty much every aspect of the delivery problem and then of course like I said coming up with an appropriate metric or a set of quantities that will help us evaluate two comparable scenarios to do some sort of an A.B. testing and the other thing is that just that not only do we not know what the sort of the global solution is the methodology for the global solution we don't know how much our solution differs deviates from the global solution right if I'm talking about the assignment algorithm and I say I get some delivery efficiency with the algorithm that I have and let's say that there is some unknown oracle given sort of globally optimal solution to the problem and I not only do not know that I also don't know how much it differs from what I currently have then of course the challenge is running carefully controlled experiments because it is like you don't want to shut off restaurants too much you don't want to like keep changing too many of the parameters connected with it so there is obviously issues connected with not upsetting the business too much and still being able to run these experiments and then of course like I said to understand the effects of unexpected delays what happens in those cases and in general it's like it's like a system with many components at many interacting components and where the effects sort of propagate so it's sort of hard to disentangle exactly which one leads to what so yeah and that brings us to the end of the talk and thank you for your time questions please please raise your hands if you have questions yeah let's start over you mentioned about inflows and outflows so do you consider order value also while you are forming the strategy for thought link right so if like I said order value if you're doing a customer segmentation then you need to do something like that yes in that case yes that means you need to make a demo you need to predict the fraction of various composition of customers that come in in the back over there you can start so it was a great session really nice to know all the problems and the intuition you had about it and the way you presented so one of the things that I would like to know is like the different set of model forms that you're used and what loss functions you have assumed like for example in some of the problems like did you take into account what is optimal for the customer experience or did you also take into account revenue which maximizes margin revenue and things like that so from a modeling point of view and also from a business point of view just want to get some point yeah so let's sort of slightly break this down right so let's let's the first part is that when it comes to say the revenue the only way it enters into sort of the delivery calculation is when you're trying to total the orders and decide whom to show which which customers will get access to viewing that restaurant or that region or some such thing that's the only place so far that we've used the actual revenue information but then the other thing is sort of more significant the idea that what what is it that you sort of want to optimize for right is a customer satisfaction now what is customer satisfaction if I promise it a delivery in 35 minutes right and if I if I took 50 minutes clearly that that is a very clear case of customer dissatisfaction but then if I had already promised you 45 minutes instead of if I showed you 45 instead of 35 then maybe it's not so bad so the question is what I promise should I sort of change my the time that I actually promise so that I can attain the delivery within that time or should I sort of minimize the overall the average time or say 90th percentile of that and typically the focus is on ensuring that order gets delivered within the time or maybe within 10 minutes of the time that tends to be the typical focus all else being equal but they are really not always equal hello yeah could you please explain a little detail about the assignment algorithm sure I mean so obviously I can't get into all the details but okay so the idea is that you want to make an assignment in such a way that an order comes in you have information about you know say what the promise time is you have information about say the restaurant something say it's preparation time right and you have information about prediction about again all of these are predictions right how much time it takes to go from the restaurant to the customer so think about this right one simple thing is that you want the order to reach say in 40 minutes and so you work backward and try and see like when it should get to the when the rest when the food should be prepared so that when the dish should reach it so that it can reach in 40 minutes because you have the time for the restaurant to the customer so you try to ensure the delivery agent gets there but you don't want the delivery agent to get too early there because then he has to wait right and so you sort of send the order that this gap is a minimum that's one thing now imagine if you have more than one order simultaneously you're batching the order right now the thing is you have now the last leg is essentially the first customer second customer restaurant to the first customer and so that has to both happen meeting compliance so you have to decide which of the two orderings is better for you and you have to also ensure that by the time you pick up both orders are actually ready and at the same time you don't want to keep them waiting it's about combining all of these different hi so you mentioned the delivery radius the serviceability radius so how is that defined because I think that might keep changing as the demand goes up so typically what happens is that there are some fixed parameters associated with different areas for these things ideally you keep them fixed and you don't sort of ideally we don't want to tweak any of these things but then the everything I'm talking about is dealing with scenarios where you need to apply some real-time strategy right and then you're suddenly facing a surge in demand is reaching points where you feel like you sort of know based on prior knowledge that it's your serviceability numbers are going to fall so you need to throttle them so then you sort of reduce the shrink the radius I don't know if I answer the question but maybe let's go apart from using the size of the delivery fleet and the average time delivery agent you takes to deliver an order do you also consider parameters like if like a very popular restaurant like truffles or Meghna biryani is not able to churn out that many numbers of orders in a small time is that also a serviceability factor for like very popular restaurants where there is a very very high demand from various channels including swiggy and other food tech companies is the serviceability factor also takes into account the capacity of them churning out those many orders right so that's the thing that I talked about like the supply side controls so yeah we certainly like look at say the near real-time last 15 minutes last 30 minutes preparation time at the restaurant or more we might use proxies like order to pick up time right so if it's too high then we try to prioritize it both like in listing like all we deep prioritize it in terms of basically shutting it down temporarily but yes those things are very much part of it especially during you know some big events and stuff Any more questions? Okay thanks now one applause for him please We'll have the BOF session starting in a bit at supposed to be at 245 it's going to be on solar women in data science BOF is happening upstairs and the solar BOF is happening here. Welcome everybody it's become a little slow evening so let's start with the flash talks do we have Sham Shinde from HelpShift here all right Dr. Amit Kureb so optimization analytics can be used in e-commerce industry I worked in the past at Amazon, Penske and few other companies in the US so when you go to the e-commerce website let's say Amazon or Flipkart you see a set of products displayed they are based on your previous history where the system has identified which are the best most most appropriate product you'll buy or which are the project product similar to that when you place an order it has to immediately figure out which warehouse is going to supply the product so if there could be a warehouse throughout the country let's say in the US Amazon had like you know 90 or 100 warehouses it would figure out these are the warehouses which are close by but let's say the order has tie, diapers and some other products some of them are in one of the warehouse and others are in another warehouse but there's a warehouse which has all these items but that warehouse is far away so those decisions are where optimizes and make a decision you know whether it's better to supply from a warehouse which is far away but which has all the item together but rather than from two warehouses which are nearby but which has only subset of those items but before even these items come from a warehouse data science is used to figure out what item to keep in which warehouse what is the combination of item which should be kept within each warehouse and where do we keep it because a single warehouse could be a million square feet warehouse so we have million items kept in that warehouse and every single day you are getting a combination of orders for example in case of cell phone cell phone and bluetooth might be ordered together it may be good idea to keep them together so based on the history of these items this is the way the items are being kept within the warehouse how many what is the quantity of items and each of the warehouse to maintain so that the order can be fulfilled from a single warehouse so once you have decided the warehouse now the item need to go to the customer some customer maybe two day customer some are one day delivery so now the best protocol has to be followed basically for sending these items across so the timing promise when you use to do on amazon flip card the timing comes as a set fix you know you will be getting this item on this date so because it is able to figure out what would be the truck what is the latest it does other way around if item need to be reached by two days when is the latest it can leave and it optimizes the cost side because maybe some of the trucks are going not going full it can put along with that truck it does that kind of thing so once it reaches from the warehouse to the nearby location then is the last mile delivery last mile process is the process which is followed from the end location to the customer location where a truck will go so this is basically when you place a order but before even you place a order their strategic decision where company has to make where should they put the warehouse each these decision based are done based on the how many customers are there in a different you look at the map of the customer and then study certain factors to figure out where what would be the best location for the warehouse each what should be the capacity of the warehouses these are called strategic decision which are long term decision it cannot be changed and then the operational decision which I was telling the third like you know when you are going to the customer what route you take those are called operational decision and third party type of decision called tactical decision which are taken for 3 month or 6 month you plan out your workforce how many people to keep it what would be the plan for the next 3 month overall basically what are the overall routes going to be are you going to go divide a city into 15 zones basically you might change it to 16 zone later on but this is the kind of planning so this strategic tactical at operational decision are made so this is how you finally get your item finally at your doorstep thank you is Shyam here Shyam you are up next so Shyam is talking about building an ML platform using Python hello so my name is Shyam Shinde so I have come from the Pune I am currently working in help chip so help chip provide customer service agents platform to solve the tickets file their end users so to it's basically kind of Shyam application that our company develops so we have input data as a text data basically so whatever the tickets that are filed by the different different customers let's say our customers are like Microsoft different gaming companies one of our customer was Mintra as well so if you file the ticket let's say I did not get my order on the time so those tickets are goes to the agents and agents replies that so basically a traditional CRM so we are applying the ML so what are the ML use cases in this in this domain so I'll explain to you use cases that we are trying to solve one of one of his like our customers define the set of set of categories of the issue that they are facing so one of the categories are like order issue or payment issue late late in delivery so whenever the issue is filed so automatically we categorize those issue in those pre-defined categories and label those issues so our then agents have defined the rules that if the label is attached to issue label is attached to this issue then route that issue to particular agent because some agents are specialized in handling particular types of problems so this is the one of the use case so text classification so another use case is like every company faces some set of same same set of issues only like for gaming companies they have like issues they lost the coins or their game is not working or upgrade is not working so and they daily get some thousand or some number of issues on that category and their agents some some people are like manually tracking that what are the types of issue they are facing and if some new type of issue comes then they have to act on it like they have to create knowledge base around it train the agents to solve those tickets so what we are doing is we historically create the topics of although on those issues and if new types of issues start surfacing in the tickets we automatically detect that type of issue and suggest our admins to hey you guys are facing this kind of issue and please have a look at it so one of the example that I can give you is one of our customer was like getting problem in a one day or in a one month and they were not realizing that they face one type of issue once in a month only and they were just not aware of it and they just get peak on that day in the issues and agent were not able to handle it but our ML use case was able to detect that and so that agent could train on it so when we have such a like a thousand number of domains and customers and these all different types of use cases to apply it so building a machine learning platform for applying ML use cases for this kind of situation is very tough problem if we have some one one set of one domain only and then we have like we can build a 10 of 15 models and we can push model to three four nodes and they start predicting or applying ML use cases but now our problem is we have large number of domains and each domain has three four domains so currently our production system is having like in thousands number of models running in a production so building machine learning platform to serve this number of domains and models was challenging problem so the how how how did we approach this so one of the way was like start using existing one of the framework like TensorFlow and all that but all these framework are suited for particular use cases like TensorFlow is mainly for deep learning and all that and not ready for production only so we build our ML platform from the scratch like we use the Python salary framework which is distributed task management system or framework okay just I'll wrap in two minutes so we use the salary nodes so whenever our agents find one set of model we get a request for it and our salary nodes build model on it and we push those models in S3 or in one of cloud and then prediction nodes start syncing those models on that node and in real time prediction service get those models and start predicting all those scenarios which I explained earlier thank you everyone just a quick announcement there's a BOF session that's going to start in a bit at 430 this is on data science and production this is going to happen upstairs and the other BOF session on data science and math that's going to happen here in about 15 minutes yeah so we have is Amisha here Hello everyone my name is Amisha and I work at Mapbox I'm going to quickly talk about one of one of newly developed data storage backend which has developed in Mapbox recently and so at Mapbox we deal with a lot of JSON's data like that is millions of JSON's have to load and unload so recently we developed this data storage backend which is for Hack it and so what it allows us it's very it's very performant and it's based on it is very much inspired from it's very much inspired from the data storage backend of OpenStreetMap which supports like tons of edits all around the world so this helps us in loading entirely new data set which contains millions of entities within it and what's interesting about it is it's open source and like anyone can make use of it so and what it provides is like you can do schema validation around that like you can just put your schema this is how the data should be in and another thing which is providing API over the Postgres database and it can help you in querying your gspecial data by bounding box or maybe by the data streams whenever you are ingesting data from upstream so whatever the data is coming the streams you can have in the streams and so that is one part of it and also it provides you a secure authentication and so we are using it heavily right now like because for this map box like we are we are providing a map with with a lot of servers from 4Square so that's again all around the world and another interesting thing about it is like it helps you visualize the data then there so there's a way like you can quickly span around the map and then check validate with respect everything like is the data looking good and so yeah that's more about it so I was just excited about this and wanted to share with you if you are at your work like deal with a lot of QJs of data and and interested in in knowing more about it and using it I'll be happy to talk to you yeah thank you that's it I'd like to apologize for the small mistake I made the math data science POF is starting right now so Amit is going to lead the session there's people who are at the back can you please come in front because this is supposed to be an interactive discussion so please come in front so it's easier for people to discuss amongst each other yeah