 conversation. So it's a trash to slot filling. So that's the fun of national language processing. So conversational assistants are pretty hot these days. Everybody wants to build one. And you can build one to do a variety of things like Vijay talk, interactive discovery or search, question answering, fact questioning, who is, you know, who was the first president of the United States. Or you can just do small talk, hi, hello, are you a human? You know, I'm feeling bad, stuff like that. And particularly the thing that I'm going to talk about is getting tasks done. Variety of tasks starting from recharging your phones, ordering your food, ordering mobiles, booking flights. You can, so there are pros and cons of using conversation in getting tasks done. The idea is to minimize the user effort in number of clicks that it does to finish these tasks. Okay, so building a bot that does tasks for you is a not revealed job. I hope I'm not equating too much. Yeah, so several layers. At the top layer is the dialogue where you interact with the user to figure out what he intends to and there are two parts to it. One is spoken language understanding, which is, user said something, what does it mean? Second, respond to that. Once you have figured out a bunch of expected world values for the tasks, you create an order, recharge a phone, you connect to the recharge provider and set up an order. With the order setup, you ask the user to pay for it. And finally, after the user's bid, then there's order processing and delivery. If you have ordered food, the food should be delivered to you. So AI or national language processing or dialogue is just a small part of setting up a business of conversational assistance for getting tasks done. And since we have 15 minutes, I'll just focus on a small part of the dialogue, which is spoken language understanding and a particular way to do spoken language understanding. And I'll show how deep learning gets involved there. Yeah, so doing the dialogue, there are multiple challenges against understanding what the user said, what the sentence means is the crucial one. Often users want to small talk a little bit before getting down to thing and that there could be topic changes and users could ask from the main topic of the conversation. And there are a bunch of decisions, how much UX you put, how much UI you put, how much chat you put. It has to be very fine tuned for maximum experience. Okay, so again, I'm going to just focus on spoken language understanding. What does it mean? In general, it's a very complex task. If you get arbitrary sentence, it's a very complex thing. There's a huge amount of research that does that. For chatting or when user types normally short sentences, the task is pretty, can be narrowed down. Okay, so you're basically looking for three things. When you get a sentence, recharge this number for a B100. The domain, the mobile, the intent, which is to recharge. And there are things called slots, which you want to fill, which are those expected values. So for doing a recharge, you want the user to at least tell you that the phone number is this and the amount is this. This is the minimal thing required. Another example, show me flights from BLR to Delhi on 4th July. Domain traveler flights, intent search, there could be other intent, book of flight also, and slots. So there are three slots that the user tells you. The location.from, which is BLR, location.to, which is Delhi, and date of departure. Note that he hasn't told you how many people are traveling, food preferences, you know, time he wants to fly, stuff like that. So there's a bunch of slots that he provided in the first sentence, and you have to interact with him to figure out that's what the conversation is all about. Okay, and we'll be focusing on how to solve this problem using deep planning. Okay, so filling in these slots is called a slot filling problem, a very popular, well-researched problem in the speech community. And interestingly for the narrow context that we have user chats, where there's a limited vocabulary and shorter sentence, this particular problem suffices many cases to understand what the user means. They're more, in general, this semantic parsing, which is pretty complex. This is an instance of it. Okay, so what is slot filling? Or how do you formulate it as a more mathematical problem? Essentially, it's an instance of sequence labeling. You have the sequence, yeah, so you have the sequence, and for every item, it's processed left to right. Every item gets a label. All labels are outside. I don't understand. This is the phone number. You have extracted this irrelevant, irrelevant amount. In here, plr, log.from, deli log.to, depart, depart. So essentially, you get the sentence, you get the sequence, you label it, and you extract the labels and the corresponding values. So here, they are contiguous labels, so you extract both of them. So I'm simplifying a lot of stuff here, but just to get the interview. And people who are familiar with pause typing, part of speech tagging, or named entry recognition, have seen variants of this problem. Okay, a little bit of math before you cringe in your seat. Very simple mathematical. There are a bunch of outputs, which are the labels, the inputs, which are the sentence. And you want, you can define a probability over output, output labels, all sequences given a particular x. So probability of y given x, that's factorized in particular. So you predict the probability of every output given the bunch of inputs you have. Okay, and the goal is to maximize the output sequence. So you have a bunch of sequences labeling the same input, and you want to have the one with maximum probability. That's what this is. So, yeah, so in this case, this is the x, x0, x1, x2, x10, y0, y0, y1. Okay, so that's the simple math. And well, I'm going to sell a little bit of math to make you understand a little bit how RNNs, which are the primary tools, work. Okay, so traditionally this problem before deep learning or before deep learning was applied to this problem, conditional random fields where the popular choice to do that, they tried to maximize given manual features, so mostly syntactic features, you know, uppercase, lowercase, post tagging, so on. And recently, I think, period of last four, five years, RNNs have taken over, like other fields, RNN perform much better than CRF. They learn the features automatically. And, and the idea is to summarize prefixes, prefixes of sentence using a hidden context, and that keeps track of the context. So I don't know how many of you have seen this. If you have, you know, studied circuits, the basic, basic feed power nets are like acyclic circuits, they don't have loops. And RNNs have loops. So they're sequential circuits. Okay, so you can unroll this loop, and you get a bunch of these. And this, this, if you've read about LSTMs, and other variants of RNNs, you can make this box, each box very complex. And you can make, in order to track the history properly. So, modulo that box, the, the, the arithmetic of RNNs is pretty simple. So, given the current context, HT, and the current input, XT, it's a linear combination of HT and UT, and a function that gives you the next hidden state. So it's a recurrent state competition. You have a hidden state, you get the next input, you update it, get the new state. And this HT sort of summarizes your previous prefix, XTs. That's all that's going on. Okay. And these v's and u's can get arbitrary calculated, they could use XTs and HTs again. All right. So that is the, so, yeah, so RNNs are the machinery that we are going to use the, solve the sequence labeling problem. Yeah, let's, I'm going to go through a, a small number of varieties of RNN and see how, how hard it is, explain how hard the problem is, and, and what do you have to add to the basic RNNs to, to make, solve the problem with greater accuracy. Okay. So, yeah, so I just showed you this, which, which, I'm going to use these dependency diagrams. So X, X0 feeds into H0, it generates this output label. Okay. So, I mean, this could be Delhi and you want location.2. Okay. And as you go to the right, these HTs capture the history of X0s and generate YT. So looking at the whole history, generate these YTs. All right. And YT is also a function, if you're familiar with the sigmoid and softmax. So here you apply sigmoid to get the next step, here you apply softmax to get the output classification vectors and you select one of them. All right. So these are basic elements where, element RNNs where the dependencies of YT is on the previous Xs and the HTs. Okay. So you don't look in the future, you don't know what the previous output label was. Okay. And that could lead to problems. So the same example that you showed, this is the kind of problem you can run into. Sentence by Balu Delhi. Log dot from, this is correctly labeled. This is wrong. This should be log dot two. Okay. And why this could happen? Because you had a training set where Delhi was in log dot from in some particular example. But RNN failed to understand that once you have seen log dot from, you should have log dot two after that. You can't have log dot from. It didn't learn from the training sample. And we didn't, you know, one of the reasons it didn't learn is because we didn't help it learn properly. And essentially it doesn't understand that the outputs labels are also dependent. There's a dependency among them. Our formulation, mathematical formulation didn't support it. So now Jordan RNNs. Essentially, we have to figure out the dependencies of outputs, okay, among the outputs. And, you know, if you are willing to pour a bit on the math, here, ht plus one is dependent on ht and xt. Okay, the previous is in state and the current input. Here it's not dependent on the, it's dependent instead on yt. Okay. So I'll let you think about it a little bit. What happens if you make it the next state dependent on the previous yt? This yt is in turn dependent on the previous. So you're capturing the summary of previous x's. Okay. And on top of that, you're also capturing this w. Okay. So if you had only ht here, you wouldn't have captured this with matrix w, which is generating the output and to generate it. Okay. So this is a particular way to capture the input dependence, sorry, output dependence on them. Okay. The key idea being, again, that you're able to also capture the effect of w in the next state earlier it wasn't. So now this should help you get the light labeling locked up too. Okay. So this is one way to improve the performance of RNN and in practice it does help you. There's another way. So CRFs, so this is a sort of diagrammatic represent of CRFs. They were xt's again, yt's again, obviously. There's no hidden layer. So hidden layers are brought about by neural networks and that's the artifact of neural networks. What you used to do, you had features that combine xt and yt, I'll call them, and there was a correlation feature between the output labels. So given a particular output yt, what is the probability that the next label will be yt plus one? That is what this matrix is capturing. Okay. So this is a different way of capturing the output dependencies. Now, can you combine this idea with what I just said, Jordanian RNNs? First of all, it doesn't make sense. So in Jordanian what you do is that the input, you try to improve the network ability to understand by adding x's and y's, the inputs, the sort of input dependency. You can use this idea on the other side, the output sides, in the loss functions. We haven't optimized the loss function yet. In RNNs, the loss function was simple per label loss. You just add them and that's a loss. So you generate log dot two and instead you expect a log dot from, you figure out that it's different, probability difference. You add all of those different. That's the loss. Okay. You could instead improve the loss function by adding this correlation, adding a new weight matrix, A, which learns correlation between output labels and have it learn together with yt's in a back propagation style. And this also, this is a different way of capturing the output dependency as compared to the adding yt's to the next hidden state. All right. So I don't know how much time I have. So I just wanted to, so CRFs, Elman RNNs, Jordan added it. Jordan, you add those output dependency hidden state, Elman plus CRF, the variety of combinations that you can build. So note that there's Jordan plus CRF which has both output dependencies in yt and the loss function together. And yeah, you could stare at this and this is already pretty good on this particular. So there are differences between, you know, learning these, running RNNs on, you know, canned data sets like ATIS versus real life, you know, user text. Generally the vocabulary is fixed here and you don't have misspellings and so on. So on that set CRFs are good, but Elm, just putting in RNN instead of manual features using deep features itself helps. If you add output dependency, it helps further. And adding CRS gives you a slight advantage, not much. In practice, if you're getting millions of chats and you want to be closer to 100%, these differences do matter. Okay. And yeah, so that's the sort of, so I basically told you about, try to introduce the start filling problem, which is at the core of understanding what user chat mean to the computer. And solution using recurrent neural networks, a bunch of architecture, essentially how do you capture the output dependency? There are various tricks and the various upcoming papers also which try to capture these, have new tricks which capture them. These tricks determine the efficiency a lot in practice. I didn't talk about a number of other architecture which you could, so you could have bi-directional RNNs, this is a diagram here. So in the previous formulation, you could have the HT hidden state is dependent only on the past. It doesn't look at the future. How do you make it dependent on the future? You have another parallel network running from right-hand side. And at every point, you do a linear combination of them, generate an output, or you can learn the beat matrices here. Okay. That helps you capture both the backward and the forward input dependency. So in practice, there are some cases, but mostly we have observed the forward capturing the past history is sufficient. Okay. That's one architecture that will help you with few more decimals. And you could also do windows or instead of considering one input at a time, you could consider triplets of interviews, sort of like n-grams. And you could, at every state, you feed in this triplets, overlapping triplets, overlapping windows of input. That helps you a few more points. Yeah. So that's what I basically wanted to communicate. How hard the solving the problem is. And I won't leave you with the thought that are there, if you think more about it, are there better architectures that could help you solve this output, capture these output correlations better? Yeah. Thank you. That's it. I think we have time for a couple of questions and then Nishant will be available also in the birds of feather session