 So this is a work which we did at Intuit and it was accepted as a paper at this year's NACL. So this is joint work with Vishwa who is here today and with my manager. That's when I think we get started. So the way in which I'll cover it is we'll talk about the problem context, then exploratory analysis, some previous work that people have done. And then we'll go to what the business wants, right? Because in a lot of ML applications, this is more important than just the theory reconciled as the previous speaker also mentioned. Then we'll go to the key idea, results and then some learning and Q&A. So a few things about Intuit. So Intuit basically has, it's a 30-year-old company based out of US primarily. And roughly 50% of people in the US who pay taxes file their taxes using Intuit products. So TurboTax is used by a lot of people. Yeah, and this is tax season in India so that kind of resonates with me. And customers interact with Intuit through several channels. So we have telephone, we have web chat, we have even chatbots. And you need to understand that when customers come to Intuit, they come with a very different mindset. It's not like doing a Google search because compliance is not a luxury. You can land up in jail if you file the wrong taxes. So this is something which is extremely important and so the kind of questions that people come to us with are also very complex. And it also means that we need to give correct answers, right? I mean, it's not okay to give some approximate answer because if you tell me the wrong tax bracket and I go and file my taxes using your answer, probably IRS will come knocking on my door next day and say, you know, what did you do? So just trying to highlight the importance of customer questions and even the importance of how accurately we answer those questions, especially in this domain. So what we did is we, even before trying to build anything, we said, let's look at the kind of questions that come to us, right? So again, this classification, again, this is just one way of classifying. I mean, there could be multiple ways in which you can split things. But if you looked at the questions at a broad level, you have two categories of questions, right? So what typically happens when you have a chat with a customer? In the initial phase, the customer agent ends up greeting the customer. Hey, hey, good morning. How was your day and stuff like that? And then in the US customers typically also exchange pleasantries that they'll say, okay, my day was good. Sometimes we even see customers actually chatting kind of at a personal level with care agents. So what happens is towards the beginning of the chat, we have these meet and greet kind of conversations. And then in the middle of the chat, you typically have the core problem that the customer talks about. And towards the end of the chat, again, you have these meet and greet stuff, right? Because if you ended up solving the customer's problem, they will end up, you know, they'll thank you. And if you didn't solve the problem, they'll probably bash you and say, okay, I'm going to complain to your manager and so on. So the point is there's a lot of these meet and greet or general conversation stuff at the beginning and end of the chat. And in the middle, you really have the core domain related stuff, right? So basically the first level split is whether it's domain specific or non-domain specific, which is general chat. Now if you go into domain specific, there are again, again, you can split into two categories. The first category is where the answer exists. When I say exists, I'm saying the answer exists in our database or in our knowledge base, right? So we are a pretty old company. So we have a fairly, you know, exhaustive knowledge base. So most of the times the answer is found in the knowledge base, right? So if it is there, you just retrieve it and show it to the customer. But there are still some cases where the answer needs to be created because there might not be a ready-made answer to the customer's question, right? And the green kind of boxes just highlight the different approaches that can be used because if the answer exists but cannot be found, the way to solve it is to use semantic search because you need to understand the meaning of the question and then essentially it's a search or a retrieval kind of problem because you already know that the answer is somewhere here. It's like finding a needle in the haystack kind of thing, but you still know that the needle is somewhere there. You don't have to bring it from somewhere outside. The second box answer needs to be created. There you basically need some reasoning over knowledge bases because now you don't have a ready-made answer. So you have to actually generate the answer using some logic, right? I mean, you might have to go to different sources. You might actually even have to do a Google search. You might have to go to the government website. When I say you, I mean, this is how typically human agents solve these problems, right? So the algorithm also has to, you know, kind of do similar things. So let's look at an example of, so why do I say that answer exists but cannot be retrieved? I mean, why would I not be able to retrieve the answer, right? So this is a customer query, example query. How to receive payment in Bosnia, Herzegovina, convertible mark, right? Now in this case, I've just shown an example where there's a spelling mistake, right? And if I basically use just keyword-based search, if I just look for the exact words in the documents, I'm not going to be able to retrieve the correct document, right? Because this kind of, these words will not even exist in any document, right? And actually, that's pretty conservative here because if I type this, I think I'll probably make mistakes in every word because Bosnia may be not, but yeah, the second word, it's beyond my abilities. So, but what you, if you look at it from a semantic perspective, all that the customer is talking about is if the customer is US based, all that they are asking is how to get paid in foreign currency, right? Because that's the real meaning of what the customer wants, right? So ideally, you should surface this answer. I mean, and this answer exists in your database. I mean, this is a real example. So this is really the answer that the customer is looking for, right? And maybe he will again drill down and say, okay, now you're giving me answer about foreign currency. Now let me drill down and like go to that particular currency, right? So just to, this slide is just to motivate why this thing can happen, why you would not be able to retrieve the answer even if it exists. So that's what this talk will be about. So I laid out the entire landscape of the three different types of questions and so on. But for today's talk, we'll focus on this particular sub problem, which is trying to extract the answers or retrieve the answers which exist in your database, but they cannot be found using keyword-based search. So yeah, so basically just look at this last bullet in this slide. I think that's the most important. So what is required here are two things, right? One, the first thing that you need is you need to be able to retrieve synonyms or similar words with a similar meaning, right? So there's a notion of semantics, right? And then second thing is the retrieval should be robust to aberrations, right? Because for example, if I type euro, so dollar might be a word which is very similar to euro because both of them are currencies, right? But our problem is slightly more complicated because I am not typing euro. Maybe I am typing instead of EURO, I am typing EZRO, right? So that's why the second thing also is important. So you should not only be able to retrieve synonyms, but you should be able to retrieve synonyms when there are misspellings or aberrations in the original word. So let's quickly look at what other people have done. I mean, this is not a new problem. So there are three different classes of approaches that people take. So the first approach is where you create word embeddings for the query and then you basically you use something like word to work or a glow or whatever. There are lots of different ways of creating embeddings. And so when I say word embeddings, I am talking about creating the embedding of the entire word, right? So you create an embedding of euro, you create an embedding of dollar and what will happen most likely is that they will be close together in that space, in the embedding space. So the pros of these approaches will be fast, right? Because you are just doing a lookup and if you do an approximate lookup, probably you can be pretty fast. You might get maybe a hundred millisecond latency. The cons are you cannot handle out of vocabulary words, right? If I misspell euro, that misspelled word is not even present in your vocabulary, right? So how do you, so there's no way to handle misspelled words or let's say someone hyphenated it in the wrong way and stuff like that, right? The second kind of approaches are where you can directly modify the incoming query, right? So think of it like a machine, it takes in an incoming query and spits out a modified version of the query and it modifies the query in such a way that the modified query can be suitably used for retrieval, right? So in these kind of approaches you can basically use standard, yeah, maybe RNN sequence, sequence kind of models also, right? Where you feed in the input sentence and you can treat it as a translation problem and then on the output side you get a corrected version of the query, right? The problem with those approaches again is that in real time you are basically, you are essentially pushing it through an ML model at prediction time, right? And that's pretty complicated because at least for RNNs you have a linear complexity in terms of the number of steps. And then the third approach is that you could do away with search completely, right? So what you can do is you can pre-create embeddings of your answers. So you take your answers and create embeddings out of them. There are different ways of doing it. You can create embeddings out of whole sentences by pushing them through RNNs or you could take embeddings of word and then do some aggregations. And then when you get a question, you again create the embedding of the question and then do a nearest neighbor kind of lookup. Again, the problem with this approach is that you still have to create the embedding of the question in real time, right? So the point I'm trying to make here is that when you look at the pros and cons you have to look at it from an industrial perspective, right? We are not in an academic environment. So, yeah, maybe in academia people might not worry about latency because who cares? I mean, you have to publish your paper. So you have three months to get the results, right? So why I'm trying to say that is because all of this is in the light of those constraints. And yeah, I'll just give you one minute to... Yeah, so let's look at the constraints that we had, right? Because like I said, we are an industrial engineering, industrial ML team. We are not in academia. So these are the things that typically, you know, your partners. So your partners are going to be product managers, engineering teams and so on, right? So these are the requirements, right? So they'll say at least an intuitive product managers want to understand everything, right? I mean, when I say everything, again, they might not understand the maths behind your models, but they need to get an intuitive sense of how it is working, right? Otherwise they'll say, I mean, if you make it completely black box, they'll say, fine, I don't think I agree with this. The reason for this is because ultimately their neck is on the line if something goes wrong, right? So that's one attribute. So it doesn't matter how funky your algorithm is. You have to be able to at least give some intuition about how it works. The second is engineering systems might, so the engineering teams might actually have a very, have a very inefficient system of their own. So they might have a latency of 300 milliseconds. But now they are going to say that, you know, when you give your system, we want you to have a latency of 50 milliseconds, right? So, and then the reason they'll say is okay, because it adds up and it, so we cannot have more than 350 milliseconds, but no one notices that they already have 300 milliseconds, right? But the point is I'm not trying to kind of make fun of anyone here, but the point is all reality is in life, right? So you need to be aware of the latency requirements. And then the third most important thing is no one wants at least the big engineering teams. They don't want to change their systems, right? Because changing systems means you end up breaking something, right? And that's why most engineering teams will like you to give a plug and play kind of thing, where you just give them something and so the best option is they'll say, okay, you just give us an API, we'll call your API and we get the job done, right? Don't ask us to change anything in our system, right? So these are the three constraints with which we were working. And now let's look at the key idea here, right? So because of these constraints, remember the main constraint was latency and the second one was that whatever we developed should have been easy to kind of plug into their existing system, right? So now let's go to the core idea of what we did. So what we did is we trained a skipgram model, but now with subwords as tokens. So training with whole word embeddings is a very standard thing that people have been doing for three, four years now. But in addition to that, what we did is we basically created character in grams out of words. And so what happens is that if you take a word like invoicing, the embedding of invoicing is not only the embedding of the whole word, which is invoicing, but also a function of the constituent subword embeddings, right? So think of the embedding of the word as a function of the whole word embedding and some function of the constituent 2 gram, 3 gram, 4 gram, whatever, character in gram embeddings, right? And then what we do is then you can basically, so that allows you to create synonyms, right? Because for example, if you look at e, half an invoice and invoice, right? In terms of the subwords, there are a lot of subwords which are common. If you look at 2 grams or 3 grams, especially except the first, except 2 or 3, 2 grams, rest of the 2 grams are common between e, half an invoice and the word invoice, right? So naturally these two are going to be quite close together in the embedding space if you create the embeddings in this way. And then why would bill also be closer to invoices because of the whole word embeddings? Because bill and invoice are used interchangeably. So even if you don't use subword embeddings, still bill and invoice will be close together just because of the basic property of whole word embeddings because they occur in the similar context. So what we did is we created a list of synonyms for every word, right? So if some of you know about elastic search or solar, these tools provide a mechanism for you to inject synonyms through a synonyms file. So this is the format of the file. So what we are essentially saying is we are creating equivalence classes. So we are saying that the words e, half an invoice, invoice and bill, all these three are equivalent to each other and they map to the canonical form which is invoice, right? So now what's going to happen is if there's a query, if there's this query, an and query on send and e, half an invoice, you will actually expand it to include the other words in the equivalence class also, right? So you will not only retrieve stuff which has e, half an invoice, we will also retrieve stuff which has invoice and bill in addition to e, half an invoice, right? And then coming to the question, how do you handle out of vocabulary words, right? So for out of vocabulary words, what we can do is let's say you got a word where invoice, let's say invoice itself and it was misspelled. So what you do is you break it up into its character and grams and then you know the embeddings for the character and grams, right? Because you have trained your model earlier. So you retrieve the embeddings of the character and grams and then that allows you to create the embedding of the whole word as a function of the character and grams. So obviously this word is not going to have an embedding for the whole word because it is out of vocabulary, but it will still have embeddings for its constituent character and grams or chunks, right? So you basically what you do is when you see an out of vocabulary word, you compute the embeddings, you basically retrieve the embeddings of the character and grams and then compute the embedding of that misspelled or out of vocabulary word and then you can do a nearest neighbor lookup on the fly and retrieve the neighbors. And this process is still much faster than actually pushing things through an RNN model, right? Because if you do an approximate nearest neighbor lookup, it can be quite fast. So this is an example of how it is done, right? I mean again if you look at, this is the correct word and this is the incorrect word where you have some misspelling here and if you look at the two grams, again most of the two grams are going to be common, right? Between these two words except for, yeah, except for two or three two grams depending on where the misspelling is, right? So the net result is that we basically got a latency of typically less than 100 milliseconds. So 100 milliseconds I think was the P98 or something. So in most of the cases, only in 2% of the cases we exceed 100 milliseconds. And when you compare it with deep learning models, you typically see latency of at least 200 milliseconds, right? So that's a clear win in terms of the latency speed at which we can do things. Now let's take a look at, so the earlier speaker talked about pitfalls in ML and yeah, overfitting is a very common problem in deep learning, right? So this slide is just trying to say that. So just to motivate this slide, we basically had a model which had probably like close to 100,000 parameters. So basically it is well known that the VC dimension for neural networks is actually lower bound by the number of parameters, right? And in most cases, it can be actually much higher than that. So order n is just the lower bound. And in our case, we actually found that 50,000 clearly was not enough. Even 500,000 was not enough because even at this point, the model was overfitting, right? And remember, this is not a very complicated model. This is just a basic deep learning model where we are training the embeddings. And only when we went to 500,000 examples, sorry, 5 million examples, I think somewhere when we went to 1 million or 2 million, it stopped overfitting, right? So this slide is just to illustrate that neural network models do overfit and this by the way is one of the slightly less complicated neural network models. So if you use RNNs and stuff like that, I'm sure probably you will definitely need to go to over a million examples. But it's slightly just to say, because I got to see this phenomenon practically in addition to looking at it in theory. So yeah, so now let's look at the data set and the ground proof, right? Because all of this is fine. I mean, there's a way to handle out of vocabulary birds, but how do we verify that things are working? So for this, for us to verify that things are working, we felt that any test that we do or any data set that we use has to satisfy these three properties. The first one is that the ground truth answers must be known because we need to know the absolute ground truth, right? We didn't want to rely on secondary feedback like clicks or browse and so on because that's again a weaker notion of feedback. So we wanted a stronger ground truth. The second property was that the query should be related to the answer, which is an obvious thing, but I'm just putting it there. And then the third property we wanted was that there shouldn't be direct word overlap between the queries and the documents. Because if there was a direct word overlap, then there is no need for semantic search. You can just do basic keyword search and retrieve your answers, right? The point is it's not easy to get a data set which matches all these criteria, right? So we thought a little bit and then we basically if you look at books, right? I mean, all of us have read technical books in college and stuff like that. So if you look at a typical page of a book, this is what it looks like. So there's a heading or a section heading and there is some text underneath it, right? So this is the body and then this is the heading. Now, and what happens is that because books are obviously written by authors who are experts in their field. So the authors take care to ensure that the heading is a summary of the content because I mean obviously you don't want the heading to be irrelevant to the content underneath it, right? So you see that the headings typically are related to the content or they kind of succinctly summarize what is there in the content below it. But lots of, because they are very short summaries, headings typically don't go beyond like four or five words. So because they are short summaries, there is not too much of direct word overlap, right? Because for example, the heading could be tax exemptions, right? And then the actual text might be following expenses are deductible and so on. Maybe the word exemption might not even occur in the text, right? Might or might not occur, right? Because the author is trying to summarize the content here in the heading, right? And we realize that books actually satisfy all the three properties that I mentioned earlier because, yeah, the ground truth property is satisfied because these are expert authors who are writing the books. So obviously, because they are experts in the field, they clearly know that the heading is relevant to the text underneath it. The second property also satisfied because clearly the heading is related to the content under it. And because the headings are very short, typically there is not too much of word overlap between the heading and the text. So then the problem now becomes given the heading, can you retrieve the matching body, right? And so that's the first problem. And the second problem is given the perturbed version of the heading where you make some mistakes in typing the heading. In that case, can you again retrieve the correct answer, right? So we created a set of around 800 questions along with the ground truth answers. And so we had to generate misspellings now, right? Because we also wanted to test out whether you can retrieve the correct answer even in case of misspellings, right? So what we did is we identified a set of 200 keywords which are important for our domain. And this was done in consultation with business teams who again are experts. So we even consulted with chartered accountants because we are in the accounting domain. And then what we did is we asked users to type sentences containing these keywords at a pace of 33 words per minute. So 33 words per minute is slightly beyond the average typing speed. I think I type probably at 20 or something. So at this speed, so this is not a very high speed. So it's not unreasonable. This is something which people will type, speed at which people will type, but it's not a very comfortable speed also. Because if you type one sentence in one hour, obviously you'll not make any mistake. So we wanted to create a realistic scenario where we wanted to gather statistics on what kind of mistakes people commit. And then for each word we collected, created a distribution over its variants, right? Because for example, I mean we looked at all the sentences typed by all these people. And then there are like 20 people who typed these sentences. So overall we had like around 1000 sentences. And for each word, we get a distribution now around its variants. And then different people will have different ways of misspelling also. So this is the way of naturally collecting statistics around how each word is misspelled. And then we had to generate misspelled versions of the sentences. Now that we had a distribution over the misspelled variants of each word. So for each sentence, what it is, we toss a coin. And if it, we have a biased coin which has a probability of 0.3 of landing heads up. And if it lands heads up, we choose one word from the sentence to perturbed. So again that was decided uniformly at random. So which word to perturbed. And now that we decide to perturbed the word, we draw a sample from this distribution, DW, which will have variants of the word, right? So this is the generative process of basically creating misspelled or aberrations of sentences. And then what we did is, then we tried to create a simulated EB test, right? So what we did is, so now at the end of this, what we had is we had around 800 queries. And we knew the ground truth answers for each of those 800 queries. And we also had perturbed versions of these 800 queries, right? So now what we did is we created two elastic search instances, right? The instance which we called control was created using basic word2vec embedding. So we created synonym file using basic word2vec embedding. And the treatment was created using our method. And what we are trying to measure here is basically we are sending the same 800 questions to both the instances, both the treatment and control. Ideally, the ideal performance is that if something, if I get the right answers for all the 800, that is the ideal performance. So obviously none of the two systems is going to perform ideal. So we want to compare, like if I send in 800, for how many of those 800, am I getting the correct answers, right? So that's what I'm looking at in this graph. And this x-axis basically indicates how many top answers I'm looking at, right? So if I only focus on the top 10 answers, right? The blue bar basically indicates what happens with the control and the green bar is what happens with treatment. So you basically see that in case of the baseline, if I send in 800 perturbed queries, I get answers only for 436 queries, the correct answers. Whereas for, if you send it to the instance using our approach, you get it for 526, right? Which is a lift of roughly, sorry, lift of 20% in terms of the number of answers we try, right? And why we can be very sure about these numbers is because we know, we exactly know the ground truth here, right? I know for sure that this is the ground truth answer for this query. So if I don't retrieve that answer, I say that I was not able to retrieve the correct answer, right? And you see similar trends. If I go to, if I look at the top 30 answers, if I look at top 100 answers, you see similar trends. But what's more interesting is when we went from 30 to 100, the green bar has still not saturated. So what it really means is that there are answers which are ranked below 30 and less than 100, which we are still able to retrieve, right? Because it has not saturated. So going beyond the top 30 is helping us. Whereas in case of the blue bar, why it is not helping is probably because the answer is not even retrieved. So even if you go to the 99th ranked answer, you'll still not get the correct answer because it's not even, it's not even retrieved, right? So that's another important takeaway from this slide that it does not saturate quickly, right? And so the next step that we are planning is it also motivates the next step, which is that we need a way to re-rank things, right? So the basic ranking that elastic search or solar provides is not working perfectly well for our case. So in addition to this model, we need another downstream model which will be able to re-rank the answers in the right order. But the purpose of this talk was to just focus on recall because our purpose here was to focus on recall. So the question is can you retrieve the answer? Now, whether it comes as ranked number 50 or ranked number 60, that's a second level question because if you are not even able to retrieve, then the ranking is not going to help you, right? So just a bit of intuition trying to compare it with other approaches like edit distance and so on. So we did not perform extensive experiments here but intuitively it seems that the embedding-based approach might be superior because in this approach there's a notion of semantics, right? There's a notion of sub-word semantics also. So for example, easier ways to look at an example. So learnings and earnings, right? I mean in terms of edit distance probably depending on how you look at it, and this is a misspelled version of learnings obviously. So the edit distance is very short here, right? So this might just zero on to earnings and these two are completely different words, right? There's no relation between them. So the problem with edit distance is it doesn't have a notion of sub-word similarity, right? Similarly, arbitrate and barbiturate. So barbiturate is a drug if some of you are interested in, yeah. And then arbitrate obviously is a very legal term which involves someone sitting in judgment on if two parties have a disagreement, a third party sits and tries to resolve the disagreement, right? So the point is we'll get, yeah, again here the edit distance is very small here, right? So you will tend to zero in on these random kind of synonyms which are really not synonyms if you just use edit distance. And similar things can happen with phonetic similarity. And another challenge with phonetic similarity is that you need a lot of handcrafted rules because different languages have different morphology and so on. So in our case, we don't have the luxury of creating handcrafted rules because our language is quite dynamic. And we want to adapt with what customers do. So this slide talks about some tips for working with business. Just read through this slide. The key message is that there are multiple stakeholders that you will always have. You will have to talk with your superiors who are VP or directors. So they are business people. You will engage with product managers, you will engage with other data scientists. And each stakeholder will have a different set of requirements, right? And they are conflicting also, right? So don't try to please anyone, just try to do the right thing. But you will get these kind of questions and it's important to have an answer, right? Because, yeah, it is fair for these people to ask the questions and you have to give an answer. But don't get too bogged down into it, right? Because, I mean, your VP might say, hey, where to work was done three years back. What's so great about it? Like you're doing it now. Maybe they don't realize that we are doing something different, right? So the point is that they read something in a business magazine and then they'll come back or they attend some five minute lecture that Andrew might have given and then they'll come and say, okay, why aren't we using the most funky technique? So just be careful of those things. And then what we learned from a data science perspective is interestingly given the state of chatbots currently, people are smart. So people quickly understand whether you're talking with a person or with a bot. So there are interesting things that happen. So we have seen that people ask questions like, are you smart when they know that they're talking to a chatbot? Whereas you wouldn't ask this kind of question unless you are completely, you know, completely angry, entirely angry with that care agent. And then people invent words, right? So Neymar Act is a very famous thing which happened during the recent World Cup. And we want to adapt with this, right? So Neymar Act has nothing to do with football. Basically just means throwing a tantrum or some drama, right? And we need to keep up with these things, right? That's why again I'm saying we can't go with handcrafted rules. So phonetics based approaches and so on, where you have handcrafted rules, clearly cannot keep up with the pace at which language changes. And then finally, we realize that we need to be humble because there's a whole category of queries that cannot be handled with current ML techniques. And this is an example. So I'm an Indian citizen who lived in Bangalore from this period and then as working in the US on an L1 visa. And then I traveled to Germany for one month and my brother has a small enterprise and I'm a 20% partner. And then the final question is, I'm allowable to pay income tax in India, right? So to answer this question, this is really spread across multiple domains, right? You need to understand immigration laws. You need to understand taxation laws of both the US and India. You need to understand what happens if the person has visited Germany for like, does one month change anything? Or if he was there for like three months, would it have changed anything? And we clearly are not able to answer these kind of questions because it's not possible to have a readymade answer for this kind of question, right? This is a very contextual and very specific kind of question, right? So clearly ML is not at a stage where you can directly answer these kind of questions and that's why the humility aspect comes in because a lot of lot more research needs to be done to answer these questions. And these are the questions that we call as the answer does not exist, right? So the answer has to be created by your model. Thank you. So if you have any questions, I'm available. Yeah. So here the key idea isn't it very similar to fast text? It is similar to fast text, yes. Okay. And the second thing, when you're adding corrective diagrams, just to, I'm just curious that it must be losing some semantics compared to word-to-word gloves. The words because you are... So, okay. So what we do is we basically the embedding of a word is a function of the biogram or trigram embeddings, some n-gram embeddings, as well as the whole word embedding. So the whole word embedding also is there, right? So it's not only a function of the... Latching. Did you explore trigrams instead of just bigrams? Yeah, yeah. So biogram was just an example. So by the way, we explored with different size n-grams. We tried ranges from 2 to 6, I think 2 to 10 and I think 3 to 6 work the best. And the reason for that is if you go to fine, if you use single letters, obviously, you'll have a lot of support but that doesn't tell you anything. On the other hand, if you have... If your minimum n-gram size is 5, probably you already lose out on a lot of words, right? So the sweet spot was somewhere between 3 and 6. Yeah. Okay, I don't know who is there. Yeah, please. So I believe the word embedding was used. So the question is with respect to if the word was misspelled, you talked about you used character-level embedding. Yeah. So that's where I'm confused. So you used word embedding. Now the entire model is based and learned on the basis of word embedding. How do you use this character embedding? I mean, I didn't get this part. Sure. So the way the model is built is that the model uses both word embeddings and character embeddings. So the output of the training is embeddings for each word that was in the vocabulary as well as embeddings for the n-grams. So let's take a simple example for a concrete case. Let's say we focus only on 2-gram embeddings, right? So what I'm going to do is I'm going to split each word into its constituent n-grams, right? And then I'm going to say that the embedding of the word is a function of the whole word embedding as well as the character embedding. So for example, if you had invoice, there will be a separate embedding for the n-gram in, right? So you already have those embeddings for in, in VO and so on, right? Created as part of the model building process. Yeah. I understood. So now the function would, the input would be the word level embedding and character level embedding and the final vector, whatever it is would be used. Yeah. And in case of a misspelling of words. Yeah. So did it actually work like the cosine similarity or whatever similarity distance you use? So the character level embedding of whatever thing is misspelled and the actual final output, like, did it have some correlation similarity? Yeah. It had. That is why we are able to retry it. So what was the function that was used to merge the two embeddings? Yeah. So we tried different, different ways. So one way was to just use averaging. The second way, second one was for out of vocabulary words, clearly there is no embedding for the entire word. So we gave more weight to the character level embeddings, right? So we use these kind of functions. And if you have specifically question is about cosine similarity. Yeah. Yeah. But I mean, what I'm saying is I don't have any numbers on what exactly was a similarity, but I think it was, I think it was beyond point data something. So, and in practice, we did see that the synonyms made sense. And in fact, that's why you see a lot of extra questions being retried in practice, right? Because if there was no similarity, then you wouldn't see a difference of 20% in terms of the extra answers. Yeah. Hello. Yeah. Hello. So myself, Rakhar, I'm working at Nikky.ai. So one of the questions that you told about handling the out of vocabulary words. Yeah. So one of the slides told about examples of learning and earning. So consider earning was there in the vocabulary, but learning was not there in the vocabulary. Okay. The way you would make a embedding for learning is by a character level, skipgram that model that you have already trained on. Correct. So really, earning is the common part of it. And L is just the missing out while giving a full embedding for the word of learning. Now you told that you intuitively saw like the Levenstein distance won't work over there because only just single word L is not there. Rest, all of the words are in common. So does it make sense to say, because more or less the earning embedding would be again similar to what learning has a part of it and the Levenstein is also giving you the same results. So yeah, but the point is that it depends on what n grams you have. Right. So in terms of the edit distance is a difference of only one probably. Yeah. If you go to the word level embeddings. So you have you, you might actually have the words earn and learn in your vocabulary. Right. Though you might not have learning. Right. Okay. So what I'm saying is if you have, if you let's say have four grams. I'm just giving a concrete example because if you have earned and learn and you know that earn and learn are very close together. So those sub-word level semantics are going to influence how close the final embeddings are. Right. Because you will have some information from four grams, some information from two grams, some information from three grams, some information from five grams. Right. So if you, if you have learned and so basically what I'm trying to say is that learn and earn obviously are not close. Right. They don't have any relationship. In fact, if you learn more, sometimes you are less. So, so the point is that those kinds of things will, will take care of separating things out. Right. I mean, I'm just talking intuitively here. Right. Whereas for edit distance, there is this, there's no notion of a sub-word similarity. Right. Or dissimilarity. Yeah. Hi. In case of out of work. Yeah. For a given word. So how do you decide actually it's a new word or it's one of the spelling issues of the existing okay words. Yeah. And we don't have any way of differentiating that, at least as of now, because one ways to use an expanded dictionary and so on. But for now, we don't really distinguish. So we see if it is, if it exists in the vocab, if it does not exist, we try to create the embedding and then map it to the closest word. So, so then the current approach, we don't, don't make a distinction between a misspelled word and a word which is not in our vocabulary, which might still be in the general English vocabulary. I think one way is to solve this, like you could apply some pruning techniques like edit distance. Yeah. The word context itself. Yeah, you could do that. I mean, the point is there are different ways to do it. And in this case, we wanted to understand how sub-word embeddings work. Right. So this is a very focused kind of work exploring sub-word embeddings. And the second motivation was we wanted it to be done in a very data-driven way. Right. So the combination of edit distance and this approach and all those things. So those you will do in practice. But for this talk, I wanted to focus on sub-word embeddings because it's a slightly new way of doing things. Edit distance has been pretty much in there from like a lot of time. And, and, and that's, so yes, you can always combine different approaches. But you got what I'm trying to say, right? This is a different way of doing it. Just coming back to learning and learning. Yeah. I think even if I do spelling, based on the context, you should be able to retry whether he's talking about learning or learning, what I mean to say. Yeah, based on, see, again, it depends on how complicated you want your models to be, right? If you actually want to use a full-blown RNN kind of model at very time, you can do all those things because the, that model will actually capture the context. You might actually even feed in the earlier utterances that are there in the chat and so on. So yeah, you could do a lot of ways, but here you have to understand that the trade-off is between latency and accuracy, right? So, so reason why we did not include the entire context is because we did not want to complicate things at this stage. So if you use an RNN model, actually you can actually push in the entire chat history till that time. I mean, in that particular session, and that will give you even more context and that might give you context to even correct the word because maybe you know that when you're talking about admission to ISE or something like that, most likely you're talking about learning and not, not earning whereas if you're talking about IIM probably you're talking about earning and not learning. So yeah.