 So we have our next speaker here, and he is Niranth. He's going to talk about how are we building smart replay for chat. So let's give him a big round of applause. Thank you, Oat Niranth. Thank you, Oat Niranth. Hello. I want to discuss what we are trying to do, how we are trying to do it. And this is basically a system design stock, or how we think about formulating a machine learning problem, specifically from a point of view of a product. I have three objectives today. We are solving a problem, what machine learning challenges we have, how we are doing that with Python ecosystem, and to use a lot of funny gifs. Before we get started, most of what we are discussing is work in progress. The important ideas and concepts which you should know are in green color. Python libraries, which we use to build our tooling, are in black and bold. So that's basically what you should pay most attention to. Most of the slides, you can ignore if you're paying attention to these things. The slides link is already there, so you can not bother about that as well. Just open the slides, you can follow along later. Listen to me now. What are we trying to do? At the bottom of the screen, at right, you'll see that this is what an agent looks like when someone is trying to use a chat interface to respond to user messages. We have two ways to do it. You have can replies where you can do any sort of backward slash commands like you do in Slack or something similar. And smart replies, which you are very familiar from the Gmail interface, which is already pre-loaded. We are trying to build smart replies and hence smart replies for chat. Why are we doing this? Well, I work for a company called Verloop. We basically make conversational software. We help in chat automation. Our idea or thinking behind this paradigm is that we're not trying to replace humans. We're trying to build tools so that we can assist them. We can guide them. We can make them more efficient and basically make them easier to help each other. That's the idea behind this. Before we continue, it's important to look at the work from Gmail who inspired this feature and how they have built this, what are their considerations and what the kind of resources they look at. So in terms of model design or how you ship a feature, email, you get a lot of compute budget. Email smart replies don't have to be instantaneous. So that means my latency time can be in the order of a few seconds. You have a lot of time as compared to chat where you're typing it in real time or pseudo real time. Throughput, Gmail processes millions of emails per minute and that's numbers which are there in the original paper. The original paper about smart reply had 11 co-authors. That's a really large number. These are people doing just machine learning for this one feature. They have a really, really large number of resources deployed to this one feature for Gmail. Serving cost. In any sort of modern cloud deployment, if you're running a software service company, you care about what your dollars spend is per feature, per service, especially if you're serving deep learning models at scale and at high throughput volume. GCP and Google in general, they have direct access to do TPU deployments or tensor processing unit deployments which are much, much more cost effective than traditional GPU deployments. Now how does Gmail process this? This is the diagram which if you get away from the stock will make or break it. Pay attention to this. This is your takeaway. Gmail processes the email and decides whether to trigger a response or not. At that point, they already have triggered and made a ticket size or a database of saying permitted responses which they would want to use. At that point, they're selecting from that which is what they would call a response selection block. This response selection output then goes through a diversity selection which makes sure that the three messages which you typically see in Gmail are different from each other and they're not all saying the same thing. And at that point, it is then served at your interface which is usually let's say a browser or an app. What is a LSTM? This is a core component of how we do any sort of response selection. LSTM typically to give a very, very short overview is our idea of how you can capture meaning from text using some sort of memory so that you can learn sequences like a text and capture meaning of it. The other thing which they do really interesting is response generation. The permitted response DB which they use they generate responses. Responses or any sort of text generation is an incredibly complex and hard problem to solve which we have as humans not made a lot of progress on except of course for Google because they can. So Gmail as a trade off suggests really, really short responses because they're trying to generate it. This is not ideal for people working in chat because your most efficiency comes from being able to suggest long responses and that saves your typing effort. So trade off differs. In Gmail it is perfectly fine because email is anyway supposed to be a long form and it is not designed to just save effort and that is not the metric you're optimizing for. The optimization metric in Gmail is basically how much is this edited. How do you do any sort of response generation? Well, the first step is basically canalization. It's a very fancy way of saying that messages which have very similar words or build on the same idea should be clubbed together and hence it's a clustering mechanism which is based on linguistics. So in this example, thanks for your kind update. Thank you for updating. Thanks for the status update. They're all canalized into a same similar format. The other really interesting idea which Gmail builds on further when they're suggesting the smart replies is any sort of graph is using a graph which captures message and response pairs. So for instance, lot of emails can be as short as let us get together soon and then your replies are when should we meet. Let's meet on Friday at nine. Those sort of pairs form nodes in each graph. What tool or idea is built? Sorry, what tool or ideas are used to build these which is a canalization component and the graph generation component. The main idea comes from linguistics which is what we call dependency parsing. It's an old idea, something which we do internally at world loop using spacey we heard from INUS today earlier in the day. We can also do this using Stanford LLP. The basic idea here is that we break down a sentence. For instance, here it says by Convalby in Chennai and it breaks down into subject and auxiliary verb which is a root verb which is be here and a preposition which is in and another object or proper noun here Chennai. This is the core information which is the part of speech tags and how they're related to each other, how they depend on each other is what we would call dependency parsing. How are we building this? Before we discuss that, it's important to understand what constraints we honor. Latency, Gmail or any email processing has a lot of compute time to it. Our compute time have to be less than 500 milliseconds. People who are coming from deploying systems and advertising or similar operate on let's say latency budgets of less than 50 milliseconds. Just 10 times of that it's a large budget because we're trying to do something complex but it is still compared to something let's say an email processing system or traditional computer vision deployments and deep learning. Throughput, our throughput or number of times you would want to suggest a smart reply is also really small as compared to Gmail. So what I'm trying to highlight with this point is that horizontal scaling when we are deploying a model is not a primary challenge for us partly because of Kubernetes which allows us to scale horizontally and also because our throughput per machine is really limited to at least around let's say 2,000 conversations per minute. The most interesting challenge for us is developer time. We had this one developer which is me doing this over a short window of eight weeks from figuring out how do we scope this, how do we build this and then how do we productionize this. This has to be done in a very constrained eight week cycle and still keeping in mind what our serving costs might be. We use CPU inference partly because we have learned with time that cloud deployments are not always available at peak loads and it's hard to add machines on a very, very quick scaling, especially when your peak is very, very different from your normal distribution and hence we rely more heavily on CPU inferences. The diagram which we saw from Gmail, this is a changes which we made from that. First thing which we make from that is we always trigger a response. A person using the interface can ignore it. That's perfectly fine. In fact, we encourage it. The other thing which we decided in the interest of saving developer time is to not do diversity selection. These are the two main changes in how your service is designed or structured. The other important change which we made is how do we get this cluster of responses which we would allow to be served. In our case, instead of generating those which is much, much complex to solve, we use previous messages from the same client or the same agent interface user to suggest what they might want to say. This allows us to use a much, much smaller search space and reduces our problem to a selection problem. We bypass the complexity which comes in from generating text to a very, very large fraction. The downside of this is, now we can only reply with exact messages which have already been typed. So these are the three different decisions we made. We always trigger a smart reply suggestion. There is no diversity selection and we do response selection and not generation under the hood. What is happening under the hood? What is the design here? This is the main paradigm which we use and in fact, most small systems which do any sort of machine learning should build this as our opinion. The paradigm is very simple. Retrieve, rank and then select from that. This also allows us to leverage modern advancements and information retrieval. The idea is we first retrieve the most similar conversations which have happened to a conversation which is happening now. Then we re-rank to select a shorter key conversations and get the top end message and response pairs and then just select the top three. So what we did is retrieve a larger set, re-rank it to get a more refined set and then select top three. The numbers look like something like this. The first we select something like 500 to 1,000 conversations, then we select 10 to 1,000 unique message response pairs and we use the conversation as a context vector or a context variable. This is used to basically improve our ranking mechanism within these 500 to 1,000 conversations. The last step is basically to select three top responses from these pairs. This is why I said don't pay too much attention to slides. Look at what is in green and orange. Retrieve, standard elastic search implementations really will do well for this. For instance, TFIDF would be a really good starting point for this. In our case, because we work with shorter message lengths which is less than 10 words usually, we use BM25 Ocopy which is slightly more sensitive to rarer words and that is important in our context but TFIDF is still a great starting point. For ranking, this is where you would want to worry about and pay more attention. Here we use ideas from semantic similarity. We use vectors from, let's say, universal sentence encoders from Google. This is something which is shipped out of the box. We use pre-trained vectors to seed or train it. Simple text-based things like for instance, shingles which is basically a very fancy way of saying long character sequences. So a word like Niranth which is my name would be broken into N-I-R-A-N-T and that would be one shingle. The last is in percent which is another vectorization method from Facebook and we combine this with, it's a cosine similarity or any sort of jacquard similarity or a simple dot product. We basically treat our similarity metrics like cosine, jacquard and dot also as hyperparameters and these are tuned to specific deployments. The last is how do you select? At this point, now you have a conversation, you have ranked them and now you're going to select the top three responses. You already have message and response pairs. So at this point we use something called as bi-LSTM architectures. We discussed LSTMs in the beginning today which is very, very fancy way of saying that we want to capture long sequences and this is where conversation context helps us as well as the message and response context and we treat it with a standard AWD LSTM which is pre-built and well-written in fast AI. That forms the backbone of how we select our messages. To recap this, we do ranking using BM25O copy. Anything from elastic search, TFIDF is also a great starting point. Retrieve any good vectorization mech technique which captures word and sentence embeddings. For instance, universal sentence encoder in percent or average of glove vectors. Great starting points. How do you select? Look at longer context. Look at the entire conversation context. Use longer deeper models here. You can go with transformers which Matt discussed yesterday or use something from fast AI which we use called AWD LSTM. What did we learn as we did this? Very simply put, this is what we learned. If you have a large training corpus to begin with, you do not need to worry about the retrieve and rank steps and you can do selection directly using graph-based deep learning models. On the other hand, if you have really small data which we do, we operate on the scale of a few hundred thousand conversations, it is much, much useful to just do any sort of ranking and retrieval-based steps. It reduces your search space and you can use pre-trained models a lot more. How do we deal with vectorization? We look at vectorization on three zoom levels, word level, sentence level, and a larger context which is, let's say, a conversation. For word level, we use something called a spy magnitude which allows us to do basic string hashing. It's basically very similar to Jensen. Those who are familiar with that, we use any sort of glove or word to work. What it does is that it will map UberX, UberXL and Uber very close to each other, even if UberXL is not in the dictionary which you already have pre-trained. Fast text is what we use when we are trying to build any sort of sentence embeddings. We use Fast text averaged or a small attention layer over Fast text so that we can capture sentences. In production, when we went, we switched to universal sentence encoder for short sentences, especially because it captures the differences between questions, assertions, statements, quite well. The last is any sort of longer context. They're probably captured better with pre-trained language models. We use XLM for capturing these, which is a modified version of BERT, better hyperparameters, better pre-trained, less noisy in terms of the character set and dictionaries you want to operate on. It is more specialized for English which is what you want to operate in. It's easier to serve, easier to fine tune and hence we decided to go with XLM. This obviously can change with time as more transformers come into play and we should probably have ways to bias pre-trained transformers in future, like we see new work from Salesforce called CTRL. This should make it much, much easier to add domain specificity, for instance, let's say fashion or e-commerce to language models directly. How do you evaluate a system like this which is built around smart replies? We look for exact matches when whatever we sent was basically clicked and sent, often did change, which we usually measured as tokens which were changed after tokenization, what faction of it was edited and the laughter's adoption. How many users used this at least in 10% of their conversions? That is what we try to measure. For visuals, this is usually me when we're discussing metrics with product manager. What are the mistakes we made as we try to do this? Privacy has been a fundamental roadblock in taking this through production and deploying it more widely. Names and addresses, when we are training on conversions, tend to leak very frequently, especially in cases of e-commerce support, which is something we should specialize in. We are really good at mining out names. We use Flare from Zolando Research and we have reasonable F1 for that, which is around 0.85. We have an internal fork for this which works well for Indian names and that does the job. If you are trying to build this, PyTorch has an official tutorial for character-based recurrent models. That's a really good starting point for this as well, a great baseline. Just to highlight, PC also has an inbuilt model for name-meditated recognition. It's optimized for longer conversations and not for one-line sentences and that's a very different problem because it leverages context itself and that is why we switch to something like Flare. Diversity selection. I really should have done this. This was a really bad call. What had ended up happening is because we did not have diversity selection anymore, we ended up suggesting responses like these which were, hello, how can I help you? What can I do for you? What can I do for you? Again, this is really bad. I could have learned from what Gmail did. I'm really an idiot. To recap, this is what Gmail did. Permitted responses are basically which they generate. We use basically a sort of selection mechanism which is based on rank and retrieve. We'll again, to get these in place, then we use these to select top three candidates and then we would want to apply a bias D for once we have selected the candidates to discover our top three candidates. So our very, very simplified journey of how a message travels in our four loops smart reply looks something very similar to this. That's a wrap. Do we have time for questions? Yeah, we have time for questions. So questions, just a second. Over there, there, the middle row, yeah. Really interesting talk, first of all. So my question is like, in my work, I am building something similar to like, it's like, let's say like someone calls me and then they're trying to reach someone and then I'm trying to ask them question more specific details and all. So this is what I was interested in knowing that when someone asks a question, then how much detail are you gonna give in the first response? And then if the response is that they're not looking for that, then how are you going to suggest other content? Like, let's say that there is a client who is calling for some very generic bank account details and you give them like, okay, this is the link and they're like, no, no, something else. So how do you keep that conversation going so that at one point they don't, like it can be settled without agent interference. Like agent does not have to manually do the task of searching the information. Okay, that's just to recap the question for everyone and to confirm that I understood her. The challenges that you have a larger conversation, you have already shared some information as a bot and automated interface and the user is not happy with it. They're looking for something else. You have that confirmation that they're looking for something else. Now how do you keep them engaged? That's a very open challenge. I can share what we do, what we try to do is we model that as a slots and intents problem where we say, hey, this is the intent and these are the slots which go with it. Slots is a very fancy way of saying that these are the kind of information which we require to answer this question and this is typically the answer which they're looking for if they have given only this much information and we model it around that. That is what we are aiming for. Most of that is still work on the progress. It's not deployed across all customers because of data challenges. But the idea being, you take slots, define them as, hey, okay, you did not want this. Okay, are you looking for XYZ? That is one way to do it. Select from XYZ, the other is, I already have certain slots which are filled. Maybe this is the other response which is possible around that. So two ways to look at this. And that's a more conversation design challenge and not just an ML challenge. Hello. Hey. Have you done any experiments on Indian languages using all the technologies? So I built a very early language model for Hindi and later extended it for Tamil. There's a lot of interesting work on our GitHub repository called INLTK. There are language models around this and I believe there's another talk around this called INDIC NLP today at Pycon as well. We have tried to productionize these but that's a more commercial challenge where people don't have enough demand for Hindi and other Indian languages yet. The technical challenge there is mostly around standard NLP use cases like dependency parsing, POS tagging. We don't have enough data from standard data sets like universal dependencies around this. So how does the models like Burt and XLM perform on Indian languages? Really bad. So his question was how bad are standard pre-trained multilingual models for Indian languages? They're really awful. Don't trust them. Thank you. Hi, I have a question here. Hello. So I saw you use PyTorch. So why not other framework like TensorFlow? Like why did you choose this particular? Are you comfortable more with PyTorch? Was this the reason choosing it? So we wanted to try out transformers. Transformers as a language model, as a technique were easier to consume via PyTorch. I'm almost equally comfortable with PyTorch and TensorFlow. I like PyTorch more because it's more extendable. It's as in this discussed in today's morning talk as well, the API is actually not trying to do everything at one go. So PyTorch just fits in better with my thinking habits the way I would want my tools to work. And that's why I went with PyTorch for these two reasons. But we are not loyal to either of them. For instance, the Universal Sentence Encoder implementation that has served via TensorFlow Hub. And yeah, so we're not loyal to either framework. That is kind of irrelevant to us. Hi. Hi. Hello. Is there any difficulties you felt in deploying with Python because of performance issues or something like that? And another question is, how much of a difference does the last selection model make up in your metrics? Like you have used a deep learning model at the last, other than the Sentence Encoder or something in the middle. So how much of a difference does it make from a TFIDF to the second part? And how much does the last part make in terms of the metrics? Any percentages or like that? OK. Two-part question, is Python slow enough for us? Actually, it's quite fast. We are happy we are able to do it way under 500 milliseconds. Our median response time is closer to 200 milliseconds, which is we're happy with that. So Python is actually fast enough for us. That is first. Second is, how much of a lift did we get from using deep learning methods? I would say around close to 15% to 20%. I would want to highlight that these are problems where 15% to 20% is a difference between day and night. Because the difference between 99% and 99.9% is basically again day and night. Language systems are very, very brittle. And when you're dealing with a vocabulary size of a few 100,000 words, every precision point is actually 20 words. So it's really important to have as much precision as possible. So 15% boost from deep learning was worth the extra 100 milliseconds which we spend on it. Hi. Hello. Hey. How do you handle the relevance feedback when you supply or give a reply? If that reply is not going to the conversation, how do you handle that initial deployment? What is the cycle that you follow for refining the responses? OK. If I understand your question, your question is how do we capture feedback? Or how do we incorporate feedback into ML? Once we push into production, we won't be asking end users for the relevance, for the feedback. For initial deployment, you need that cycle to refine your point. So how do you handle that relevance feedback in your framework? I guess if they could turn on the slides. But we have a slide on discuss evaluation metrics. They also capture our feedback signal very well. An exact match says that whatever I suggested was brilliant out of the box. So I don't have to worry about that. Percentage change means that I was somewhat right. And it gives me an estimate of how somewhat was that. And adoption rate tells me that if I'm completely off the mark of the mark, then it would just not get used. So these are the metrics we look at to understand whether it is working in production. This is captured implicitly. And that's our product design challenge and not an ML challenge or engineering challenge. Thank you. We have time for one more question. Over there. Third row. Last row. There? Hi. Hi. Hello. Yes, hi. So that's a great conversation. And you have a great product. So my question was that what about domain-specific conversations? So how do you handle that? That conversation will be some domain-specific. Can you give an example? How do you define a domain? Like a domain. Like there is a startup who is working in education industry. There is a training institute. So the conversation will be based on that. But if there is a hospital, so the conversation will be on medical issues and all that stuff. OK. So we use the fast text embedding which we mentioned. We combine that with the universal sentence encoder vectorization. The fast text is expected to capture domain-specific vocabulary because it uses subword embeddings as well as we find unit using our own internal corpuses. So that captures our domain-specific vocabulary embeddings. And if you are lucky, some degree of semantics as well. But in practice, it doesn't capture any of the semantics. But it captures vocabulary and embeddings to a reasonable degree. OK. I think we are right on time. So that was the last question. Cool. I'm happy to answer all questions offline as well. So he'll be available offline. And the others can just reach out to him. Thank you.