 Then we have startups who want to enable commerce via doing chats, right? And then the idea is that they want to automate all the things that you do in different apps in one app and do it using chats. Then we have a personal assistants who want to solve one problem end-to-end, problem like automating the way you schedule your appointments, right? And then we have developer platforms where people are building really cool high-tech stuff which they want to sell to different businesses to solve problems like automating customer care systems. So we sometimes wonder whether the chatbots that we see, whether these are really AI-powered or human-powered or human-assisted. So why is there so much interest in chatbots now? So if we wind the clock 10 years back, we had Google search bar, which was all the rage among web users, where you can just type a simple text-space search query and then you can get a bunch of results and then you're happy, right? And SMS was pretty much the mode with which we used to do messaging. Now 10 years from then, we have tens of messaging applications on our phones. Billions of users use these applications every day. And VChat, for example, has done a great job in terms of connecting consumers to businesses to do commerce, right? And if you ever use Amazon customer care, you might have been surprised by the fact that your query can be resolved in no time if you just do chats, right? So it is this notion of interactive search that is beginning to take over the notion of just typing a simple search query and seeking response. So in what all cases should we replace search bar by interactive search? So we pretty much see this search bar everywhere. If you go to a mobile app, you still have a search bar to do searching. If you go to web apps, you also have search bar. So on one side, I have shown a few applications which involve a few simple steps like booking a cab or paying your bills. And on the other side, I have applications like booking a holiday or you want to shop a fashion item, which are more complex, which require multiple steps, right? So let's say you want to book a cab, right? On the right-hand side, we have a popular taxi booking application which shows nice visuals. You have real-time information being shown there. You can pretty much do everything by swipe or by touch. And on the left-hand side, I have a chatbot from App Store. So let's say I ask chatbot to book a cab. And it tells me that your location is this. And what are you looking for? Are you looking for a hatchback or sedan? So if I select sedan, then it tells me, sorry, sedans are not available. What is your preference? Then I give some preference, but it seems it has completely ignored it. It is still stuck asking me, what do you want to book? Whereas I'm sure all of you have used taxi booking applications, so you know the kind of experience that you get when you have applications where you can do things by swipe, touch, and done, right? So that is what happens with these apps today. But we have chatbots where we have to type things. These don't really understand what you say. And these sort of hide information for you and give you a delayed response. But developers have also built chatbots on Facebook Messenger applications, right? So I managed to try one such chatbot which can do fashion shopping. So I asked, what can you show me? And it said I could not understand you, but here are a few tennis shoes. Then I said, I do not want shoes. And it said, sorry, I could not understand you, but here are some more shoes. And somehow I managed to type tops because I want to buy tops for my wife. And when I typed that, give me something. And I said, give me something between 1500 and 2000. It again, I didn't understand what I wanted to say. But you folks can understand that I was trying to budget my shopping, right? So again, we have chatbots which lack intelligence, which lack a personization, and which can't understand the context. So in my opinion, all these are attempts to morph the existing apps into chatbots. And people are somehow making simpler intuitive interfaces more complex and less efficient. So what kind of use cases for which we should really build chatbots? And what does it take to build such chatbots? So we at U, we are building a fashion commerce application. We are building a fashion shopping assistant. And we actually sat down with more than 50 shoppers to understand the way they shop, and the kind of experiences that they get when they have to go to a famous fashion commerce platform. So fashion is one domain where there is no single spec as compared to consumer electronics, right? There are lots of styles, styles keep on changing. And hence, people open multiple tabs when they have to search for something. And in search bar, many times people type, I want to shop something for Officeware. And that search bar doesn't really understand what to show for Officeware, right? And many times you see results. So even if you type Officeware, you see some results, but you see really surprising results. And I'll show you a few going down. So the interfaces that we have there, those don't really take any kind of feedback from users. So if you just want to say that, hey, I like this, I don't like this, there is no such mechanism to take feedback. And the dilemma of shopping still remains because there is these online search interfaces that don't really help you in making decisions. So I feel that to go from this kind of experience to actually have experience of a chatbot, we need four building blocks. A, we should be able to extract intent from the text. B, we should be able to provide results that are relevant for your intent. C, have some kind of contextual interaction. And D, make the experience personalized and help people to finish a task, right? So in this case, if I type, I want to shop evening party shirt in blue, so those spelling mistakes are deliberate because that is how you type, right? It should understand that I'm not just searching for any kind of shirt, but I want to go for a party. And then the kind of results that I see, those don't have checked shirt or lining because the bot understands that these are reserved for office wear. And then it somehow figures out that maybe I should ask the user what kind of color you're looking for because you're going to go for a party. And then it basically asks you if I can personalize the budget, makes sense? So I actually wanted to go step back and then tell you why block number two is important. So if I type a few search queries, so here I have type evening party blue dress. I'm one of the famous porters in India. And I don't see any dress here. And it has somehow picked up that evening and blue from some product and it has, so these are the top three results, right? That you get. And this is not a problem just with one commerce portal. So if I type something like office wear blue shirt for men, I get all sorts of results. So here the products are not well tagged because I'm getting both men as well as women items. The products are not normalized because, for example, a color, right? I'm searching for blue, but there's only one item there. And again, the products don't have tags like office. So as I mentioned, we can build chatbots for such experiences with four building blocks. And these building blocks can also apply to build chatbots for some other domain. So here for example, if you want to build a chatbot to book a holiday, the bot would understand a number of days for which you want to go. And the place where you want to go for a holiday. And if you want to go to Kerala, then it basically tries to ask you which particular location you want to go to. And based on that, I think you got the idea, right? So for the rest of my talk, I'm going to focus on how we can apply deep learning to build these two building blocks, extract intent, and provide results which are relevant to your intent. So here, for example, if I type shop evening party shirt, the intent extraction means that I've identified that there is apparel present, there is occasion, there is time, and there is color. And based on that, I have done search over a catalog, which is very well tagged. And I'm providing results that are relevant to occasion and time. So I'm going to go into some more details now. So I can break down these two blocks into three technical problems. A, we first should be able to understand the text. And I'm going to take instance of semantic spell correction, because that's the most common problem that we have seen when we have to build a bot ourselves. Secondly, I'm going to talk about extracting entities from the text. And this problem also applies to catalogs, because if you see e-commerce catalogs, you have picture and you have description. And that many times, the search engines don't really pick up all the keywords that you have in the description. So you also want to extract different entities from that description so that you have well tagged catalog. And then I'm going to also make use of images that you see in the products to extract the categories and attributes. So the first technical problem is from noisy unstructured text, how should we generate a structured intent? And I feel to solve this problem, any text processing block should have three properties. A, it should be able to understand context. B, it should be able to understand a semantic similarity. And C, it should be able to understand sequencing. So here, for example, I have this example of a shirt dress. So here, SHIT is not spelled well. And if I have to correct for it, I have two choices, SHOIT or SHIRT. So if I take simple edit distance, I don't know which one to correct for. But because I know that there is office as well present, I know that I can perhaps correct it to SHIRT, because office shirt dress makes more sense. Similarly, in a second example, a user has type. I want to shop for a sober shirt. And if I have a formal as an attribute in my catalog, then I should be able to figure out that I should better map sober as a formal and then give the user results which match a formal shirt. And a third example, which is more complex, where a user is saying that I want to shop for official dress that goes well with white jean, preferably in red color. Now, with whom should we associate red jean or dress? So if you just take word-based distance, you might associate that with jean. But if you actually think about encoding context within sequencing, you know that this red belongs to dress and not to jean. So we have seen several speakers from the morning who have talked about deep learning being applied to images and to text. So CNNs, for instance, have been shown to be very effective when you have to model context, because they understand morphological information that is present in the words. And they can also manage context well. So here, for example, if I give a lot of data to CNNs, they would be able to understand that office and shirt appear together a lot of times. And hence, you'd be able to solve the earlier problem of correcting SHRT to SSIRT. Word embeddings, again, have been shown to be effective to model a semantic similarity. However, these have some limitations. So in case of word embedding, most of you might know that you have to have all the data that you hope to get at test time to be given to a word embedding block so that you have a feature for it. So this is a word to wake works. And a word to wake, in some cases, has not been shown to be that effective when you have to model different dialects of languages. So recently, there have been some research work which talks about how we can replace word embedding by a character embedding. So here, you have a fixed dimensional vector for each character, and when you get a word at runtime, you form a feature vector. So that way, you can handle new words. You can also handle different dialects. And to model a sequencing, LSTMs and biodeaction LSTMs have been shown to be effective. And I'll talk about them some more. So we can combine these things and build a generic text processing network. So here, in this example, I have a paragraph at the bottom which says, show something for adventure trip. So now I'm going to parse words of this sentence one by one. So here, I have this word, adventure, for which I have constructed a D dimensional into a length of the word dimensional matrix. So this is something similar to what you get when you have to parse images. So this is now a two-dimensional matrix. Now, on this matrix, I'm going to apply W number of filters. And I'm going to form a deeper feature vectors. And then I'm going to apply the usual max pooling to form a feature vector. So now I have a feature vector for the word adventure. Then I'm going to give this feature vector to something called as highway network. I can talk about the offline, but highway networks have been shown to be more effective when you have to carry forward some of the context that you have in feature vectors to output vectors. So they essentially model morphology and context better when they are trained with CNNs. So now after highway network, I have a modified feature vector, which I'm going to give to LSTM. And LSTM is going to take these feature vectors one by one, and it's going to predict a new word. So it's going to predict to which word I should correct this spelling. And the output of LSTM is going to be the size of output is going to be the length of vocabulary that you have. So you want to predict a confidence score for each of the words that you have in your vocabulary. So now to this network, I can now give a bunch of pairs where I have misplay sentences, and I have misplay sentences which are correct. And that is how I can train this network. So we used TensorFlow to build this network and to train this network. We use AWS GPU instances. And to get raw data, we actually scraped more than 50K fashion blocks and social media streams, because you want a mix of data where you have correct spellings and incorrect spellings. And we also induced some transformations, some generic transformations. There have been studies which talk about how people type on phones. So using that, you can induce misspellings. And so we also took data from lots of product portals. And we had about 30,000 pairs of misspell and correctly spelled sentences. And some of these, we actually went and corrected manually just to ensure that the automation of data processing is working fine. So for the earlier network that I showed, we had a character embedding of length 20. We applied 1,200 filters for the CNN. We had two layers of hybrid network. We had three layers of LSTM. And the feature vector length of hidden units for LSTM was about 4,000. We had a total of 62 unique characters and 6.5,000 unique words in our data set. And the total tokens that were generated, this is all the words that we had in our training set, those were more than 5 million. And we used two metrics to evaluate the previous network. First, by complexity, which is e raised to negative likelihood by a number of tokens, it's essentially normalizing the loss that you have over tokens. This is a number which can go between 0 to let's say 200. And lower the number, better is your network. And we also used F1 score. We experimented with, nonetheless, a word CNN plus LSTM, and we removed the hybrid network just to see what kind of performance you get. And so as you can see, for the character-based embedding plus hybrid network plus LSTM, we are getting the best performance. So the complexity is, in fact, more than 35% lower, whereas the F1 score has improved by 10%. We also played with the way you size your character embedding and the way you size your LSTM. So we kept D, which is the length of character vector, to 10. And the length of hidden unit that you have in LSTM 200. And the performance was not good. So after doing a lot of experimentation, we felt that D equal to 20 and M equal to 400 works well. So we also used learning from these networks to evaluate if user types, something that we don't know, whether we are able to extract semantically closer words or not. So we gave invocatory word, which is complement as input. And then we asked, using a cosine similarity, give me the top five closest words. So for the words CNN plus LSTM, we received complementary and matching. So network has learned some semantics. So this is what we received. And for the character-based embedding CNN plus hybrid network plus LSTM package, we received suitable matches and T. So this shows the power of modeling text by starting from characters. And we also gave a word which was not in our vocabulary. So sky blue polka. This is a pattern that you have in fashion sector. So because word to vague cannot have vector for the words that it hasn't seen, we can't get any word that is semantically closer to this word. But our network gave lots of interesting words, which are actually semantically very close. So I'm not going to talk about the second technical problem, where given you understand the intent, you want to have a very well-tagged catalog. And the problem here is that you have lots of product, image, and text data from different portals. And you want to normalize it. So you want to have a single schema so that you can do a good search over it. So this actually means that we have to extract attributes from the images and attributes from the text. So if you look at the image, a crop shape is something that is not present in the text, but it is there in the image. Similarly, a pleated. So this is a type of skirt. So the word pleated is not there in the text, but it is there in the image. Similarly, in case of text, we have brand. That's there. So if somebody searches for a particular brand, now we should be able to give out results. So the technical problem here is that, how do you extract categories and attributes from the images? And how do you extract entities, their types, and the relationships from the text? So I'm now going to extend the network that I explained earlier to solve this problem of extracting entities and their types from the text. So we have a similar architecture of having CNN and IA network, but now I have a bioreaction LSTM, where the LSTM now not just looks at the history, but it also looks at the future. And now I take outputs, the output feature vectors from these two LSTMs, and I have a CRF layer. So the output that I get is a type of entity. So I have here a sentence, red, zara, shirt, dress. And what I'm getting is color, brand, shape, and apparel. And again, you can form pairs of text where you have a paragraph on one side and you have the entities on the other side. And then you can train this network. So after training this network, we applied it on some complex text descriptions. And if you see, so this is a real description that you have on the product portals. So if you look at this sentence, this midi with sleeves as full. So here it has identified that a midi actually corresponds to length. And full is not a full neck, but it's a full sleeve. So if you have to write a text parser to parse these kinds of sentences, of course you learn out of rules. Let me also touch upon the way we extract different categories and attributes from the images. Some of you might be familiar with segmentation-based fully convolutional networks. So these were proposed last year. And these have also been used in self-driving cars. So what we do here is we take image from product portals and we segment each of the labels. So essentially assigning label for each a pixel of the image. And that's now your input. So that's the input to be a fully convolutional network. And in the end, your goal is to reproduce these labels. So in the process, network is going to learn some low-level and high-level features. So this is an example that I'm giving out of our network if you give this image as input. That is the output that you get at test time. So using this output, you have a subtracted background from foreground. And now you can also identify different attributes and categories from the image. So to summarize, I talked about four building blocks that I feel anybody has to build if you have to have a great chatbot-based experience. Actually, I like to refer to it as interactive search instead of calling it a chatbot. And I feel that if we focus on solving some complex search aspects and focus on building interactive search for a specific use case, I will be able to build a great consumer experience tools. And equally important is the role of deep learning where you have to parse, text, speech, or images at the input layer. And parse, text, images of the information that you have to collect at the back end to provide results to the kind of search queries that you would get at runtime. So the speakers from the beginning talked about the experience with which real researchers have figured out lots of these complex issues. So this is just a token of appreciation to people who work hard to do research. And we, startups, we make use of that research to build something useful. And I'm also thankful to IBM and AWS who have been very helpful to startups in terms of providing their compute so that we can experiment and survive and then build some experience for consumers. Of course, NVD. So that's it from my side. If you have any questions, I can