 So today's talk is going to be on sequence-to-sequence modeling using encoder-de-coder neural networks. I'd like to start with just a brief personal history, if you will. I did my PhD in 1991 at Rutgers University, and at the time I was working on neural networks. Surprise, surprise, right? It's how many years ago is that, 28 years? I was working on neural networks. I would consider that the first phase of the current wave of neural networks. Probably people have been working on perceptrons and stuff even well before that. But in the 80s and 90s, neural networks was a hot area, and I was doing my PhD in neural nets. Fast forward to about 1994, three years after my PhD, I was working at SRI International, otherwise previously known as Stanford Research Institute in Menlo Park, California. And I was working in the area of speech recognition in a lab that was well known for working on speech recognition. And one of the things that some of the researchers in our lab was doing was trying to use neural networks to do speech recognition. As you may know, the technology for speech recognition had been hidden Markov models, and particularly using Gaussian mixture models inside of a hidden Markov model. And what our colleagues at SRI were doing in collaboration with people at ICSI at Berkeley and IDIAP in Switzerland was to replace the Gaussian mixture model with a neural network. At that time, it didn't work very well. We got some interesting results. But it didn't beat the Gaussian mixture model system because the neural network wasn't very powerful, didn't have many layers, didn't have many nodes, and we didn't have gobs of compute power to train on thousands and thousands or tens of thousands of utterances of speech. So it was a cool idea, but really didn't work. Fast forward another 10 years perhaps, or maybe you know, 20 years. 2015, I found myself at Google working in the speech recognition group at Google. And by that time, Google had already gotten deep neural networks working for speech recognition and had productized the system as well. In fact, deep neural networks made their first impact in the area of speech recognition before then becoming useful for image recognition and then the rest is history as people would say. But the interesting thing to keep in mind is that the ideas that were being explored in 1994, 1995 by SRI, IDIAP and ICSI were identical to the techniques that were being used in 2015 by Google and Microsoft actually came up with the first industrial application and then Google followed up with it. But the basic idea was the same. So the difference was that we were able to train with tens of thousands of utterances. We were able to train neural networks with ten layers instead of two layers, you know, maybe several hundred nodes per layer instead of just a few tens, and that made the difference. So neural networks has really taken the world by storm as you would all know since 2010 around that time frame, which is when it became a big hit in speech recognition and then now there's been the use of neural networks in all kind of areas. Natural language processing is one of them and today I would be using natural language processing as an example application area if you will to explain this particular type of neural network model. My name is Anand Shankar. I work today at the LinkedIn office in Sunnyvale, California, actually Mountain View, California. I'm a principal research staff there and I work in the area of natural language processing as well as multimedia understanding. So today's talk is going to be a sort of a pedagogical talk. I'll be teaching you about sequence-to-sequence learning with encoder-decoder models. If you guys know about this, then hopefully what you will learn is maybe refresh your minds about these models or maybe you will learn some new ways about how to explain how these models work. I have very little math in my presentation, so if you guys watched the talk this morning, which I think was one of the best talks I've attended, and if you haven't watched that talk, you should go and watch the recording of that talk. I hope this talk will take at least some of those ideas in this presentation. So with that, let's start with natural language processing with neural networks. So a neural network is a black box that takes as an input something and produces an output something else. So for natural language processing, what is a natural language processing engine? It's something that takes a sequence of words, because natural language is a sequence of words, and does something with it. So let's say this sequence of words mice eat cheese is the input to this neural network. We can do several things with this sentence. One thing we could do is we could tag each word in the sentence. Mice is a noun, eat is a verb, cheese is a noun, and maybe we want a neural network to do tagging. We want it to tag each word correctly. That's one possible neural network model. Another neural network model may be a model that is going to take this input sentence and decide what language it is in. Is it Hindi? Is it English? Is it Tamil? Clearly it's English, so we would like the neural network to output English in this case. Another application of a neural network for natural language processing. Yet another one is we may want to take the input sentence, which is in English, and translate it to German, Moise, Essen, Kieser, which is mice eat cheese in German. So this would be a translation task. And there are several such applications one might imagine that a neural network could do. So I just wanted to give you some motivation of this is a black box, the black box can do several things, but for different type of applications this black box will look different. Today our black box is going to be an encoder-decoder model. So let's start with how do we represent words? For a general encoder-decoder model you probably don't care about how you represent words, but today because I'm taking natural language processing as an application, as an example application, it's a good idea to start with this, right? So how do we represent words? A word is, here we have three words, mice, eat, and cheese. And there are, this is one way in which you can represent the word. This is called a one-hot encoding representation, where each component of these vectors corresponds to a vocabulary item in a vocabulary. So let's say I have a vocabulary of 100,000 words, or 1000 words, or whatever the vocabulary sizes, each component in that vector corresponds to a specific word in the vocabulary. So in this example my vocabulary only contains seven words, right? Cat, cheese, money, mice, rats, eat, and horses. These are the words that my vocabulary contains. This neural network that operates on this vocabulary will not be able to do anything with any other word, except the words listed over there. So if pizza was entered to this neural network, you'll say, too bad, I can't do anything with it, right? Because I don't know what this word means. So in this representation, the way we will represent each word is by simply putting a one in the position corresponding to that words in the component of the vector, and a zero everywhere else. So when you want to represent mice, you have a one where the word mice is for the component for mice, and a zero everywhere else. For the word cheese, we have a one in the second position, because the second position corresponds to cheese and so forth. Now, what's the problem with this representation? If I told you that mice eat cheese, and you know what a mouse looks like, and I'll show you a rat, and you look at the rat and you look at the mouse, and they look alike, and you can pretty much guess that if this guy eats cheese, this also probably eats cheese, right? You don't need to be told that a rat eats cheese. You can probably guess just by looking at these two animals that if I tell you mouse eats cheese, you can also infer that the rat eats cheese as well. On the other hand, if I showed you a horse, which looks very different from a mouse, you couldn't really infer that a horse eats cheese. Maybe it does, maybe horses do eat cheese, but you can't really tell that a horse eats cheese simply because the mouse eats cheese, right? You get the point, you can't infer this. Now, with this representation, which is the one hot word encoding, it is a poor representation because there is no semantic power in this representation. If you look at the distance between the representation for mouse, mice, and horse, it's going to be the same distance between the representation for mice and rats. So a rat has a one in this position, a horse would have a one in this position, but the distance between these vectors is going to be the same, right? Because it's just a one in some position, the hamming distance is the same. So I cannot really tell the difference between a horse and a mice, though it's as close to a mouse as a rat is, so you can't really tell if a horse is going to eat cheese or not. So that's why this representation isn't very good. Instead, what people use is what's called a word embedding. A word embedding representation gives some sort of semantics to the meaning. So we will have in each component, think of it as representing some concept, like food, verb, rodent, and so forth. A mice is a food because a cat can eat a mouse, right? A mouse is a rodent. I'm using mice because plural and singular, but a mouse eats is a rodent. Its size is small, and so I have a 0.1 for size, a 0.8 for rodent, a 0.2 because it's the food for a rat, and so forth. Similarly, for horses, it's a very big animal, so its size is 0.9. It is a herbivore, so it's got a 0.8 over here. It's not a food for anything except perhaps a tiger, but I'm not considering tigers in my universe, right? And so you can see the horse looks very different. A rat, of course, looks very much like a mouse. It's small, it's a rodent, and so forth, right? So now, if you know that mice eats cheese, we are going to consider to do rats eat cheese. Well, we're going to say a mouse and a rat, they look very similar to each other in this space, and so yes, it's probably going to eat cheese. On the other hand, a horse and a mouse are very far apart in this space, and we cannot conclude that a horse eats cheese. So this is kind of the idea of word embeddings, right? Word embeddings give semantic power and allow you to sort of, you know, measure distances between words in a semantic space. These embeddings, of course, are learned automatically in a neural network. We don't actually have these components of a word embedding vector, mean, herbivore, rodent, a food, and things like that. They're automatically learned. We just give it like a 100-dimensional vector, and the neural network automatically decides what, you know, values to give to these 100 components and automatically figures out what they might mean, right? So this was just a sort of way to explain it, but really in real life, that's not what we wouldn't have specifically these things meaning the components of these vectors. The dimensions do not have specific meanings like in this example, right? Now, how do we convert from a one-hot embedding to a word embedding? It's very simple. We just have a matrix where for each of the words in your vocabulary, the columns of this matrix represent the embeddings of that particular word. And then we have a one-hot vector over here, for example, for the word dog, and when we multiply this matrix with this vector, we will just pick out the column corresponding to the embedding for dog and we get the representation of dog. The reason I'm telling you this is because in any neural network, the input is typically the one-hot representation, and the first layer of the neural network would be an embedding matrix, which would then give you the embedding representation of that word. Let's keep the questions till the end unless it's a pressing question that you can't understand something. The vertical column here, so each column here represents the embedding vector for that specific word. So like I said in the previous slide, these dimensions don't mean anything, right? In the example I gave you, these dimensions meant is it a rodent? Is it a herbivore? Is it a small animal and so forth? But the point is that in this four-dimensional space, if two vectors are close together, that means those two words probably mean similar things, okay? So now let's move on to a recurrent neural network. This is a building block for what I'm going to be talking about. We are going to talk about a recurrent neural network being used to do what we call language modeling. We want to predict the next word in a sentence given all the words that you've seen before. So let's pretend that we've just seen the word dog and we're trying to predict the next word which is chased. So what we do first is we push dog through this embedding matrix E to produce the embedding vector, which is the green dots which I've already shown you how that works. Then we take this embedding vector and pass it to a recurrent neural network layer. The reason it's recurrent is it because it has a connection from the output of the network back to its input, right? Normal feed-forward neural networks will not have that connection which has the WH a label. WH is a matrix which is a self-connection matrix. It tells how to, what's the weight of connecting the output of this layer back to itself. Normally you wouldn't have that in a feed-forward neural network. In a recurrent neural network you have that and that's the only difference between a recurrent neural network and a feed-forward neural network. The way WE is simply going to multiply this embedding representation that I already showed you and input it into this RNN. And then finally these red dots represent the hidden state of this RNN, right? Which gets fed back to itself and it also gets fed to the next layer where we will use something called a softmax function to output a vector yt. In this case the vector yt is trying to predict which is the next word. So I want the vector yt to predict chased, right? Because the input word was dog so I wanted to predict chased. It's a language model to predict the next word. So what is yt going to look like? Yt is simply a vector of dimension, number of vocabulary items. It is a one-hot word vector sort of representation. We want yt to have a high value in the position corresponding to the word chased and a low value everywhere else. I'm not going to talk too much about these equations but I talked about how the neural network works in terms of the weights and the self-connection and the softmax. I think you know about all these terms. I wouldn't belabor that. What I would like to say is that is one way of representing a recurrent neural network on the left. You can think of that as a static representation but what you will often see is a representation like this on the right. This is called an unrolled representation of the recurrent neural network. All we've done here is we've taken this same structure and copied it as many times as there are words in the sentence. First we get the start of sentence. It goes through this network and hopefully produces the word there. Then we will take the word there and produce the word dog and so forth. But these weights are shared. All the weights over here are the same weights. The weights over here are the same. The weights over here are the same. The weights are copied in time when you unroll this neural network. It's just a way of representing it. When I was trying to show a sentence which is longer than the sentence, I would unroll it even more. If it was a short sentence, I would unroll it less. This is how recurrent neural networks are often represented. What we do is we're trying to predict the next word for each given word. When the word is the start of sentence, we want there. When the word is there, we want dog and so forth. During training, what we're going to do is we're going to give this network a whole bunch of sentences and we're going to ask it to predict the next word. We show the word dog and it has to produce chased and we give it then there and it has to produce cat. But remember, when it has got the word there, it has already seen these other words. You can see in this unrolled representation that the output of the word there over here, the output of y5 is going to depend on all the other inputs that it saw before the word there. Because you can see the connections over here. There's a connection going all the way from here through this hidden layer over here and up here. So the output of the word for the word there will depend on everything before it. And we wanted to produce a one where the word, for example, these vectors y are essentially word vectors in the one-hot representation. They are of size, the vocabulary size. And we wanted to, let's say, produce the word cat. Let's say I'm trying to produce this last word. We want the output of the neural network to have a one in this position. Maybe the neural network has a 0.4 instead of a one. Maybe it has a 0.1, right? And there's an error between 0.4 and 1. And what we're trying to do is minimize this error. And we do that by minimizing cross entropy or maximizing the log likelihood. Either of those things work. And this equation just shows how we do that. All right. So another way of representing a recurrent neural network is this. They remove all these dots, remove the weights, and simply draw it like this. This is probably the representation that most people are familiar with. And when you look at that, at least me, when I first saw this, I said, what the hell is going on? I can't really understand this thing at all. But really, what's happening is you have an embedding matrix. You have another W matrix. You have another matrix here, which is the self-connection matrix. And you have another matrix over here, which is connecting the hidden layer to the output layer. So you need to use all of these things. See, all of these things had to be in your head when you think about this sort of representation. And this representation over here is simply a recurrent neural network with multiple layers. It's a deep recurrent neural network. It's a deep neural network, right? So if you add depth to this neural network, then it turns out that it has more expressive power. And so people do that to get more better results. So so much for recurrent neural networks for language modeling. Another thing to keep in mind is an unrolled neural network is really a feed-forward neural network. And I'm going to try to explain that over here. So I'm just going to take this picture that we had before, this one on the right, and show you and draw it in a different way. So let's start with this. It looks like I'm just going to draw it like this. And remember, I have these weight vectors, these weight matrices. I'm going to draw these weight matrices. And that picture is identical to this. You have h0 and x1 coming in, and you have h1 and y1 coming out, right? So I'm just drawing it over here like that. Now, the next step is to connect the... we're unrolling it in time, so that's time number two. And what you're doing there is we're going to connect it this way. It's exactly the same picture. I'm simply drawing it in a different manner, right? You can see this block and this block look the same. It's just a little longer because I decided to put this x over here instead of over here, all right? And then in time three, it looks like that over there. Over here, it looks like this. This one, all of us can recognize, is a feed-forward neural network. Maybe you could recognize it there too, but it's not so obvious as this picture, right? So this is a feed-forward neural network. Everything is going from the bottom to the top. And this feed-forward neural network has three layers, as you can see, right? It corresponds, each layer in this feed-forward neural network corresponds to a position in time in the unrolled, recurrent neural network, okay? So what does that mean? What that means is any machinery you had to do training for feed-forward neural networks, you can just use blindly for recurrent neural networks as well. So in fact, the way feed-forward neural networks are trained is using something called back propagation. We have multiple layers in a feed-forward neural network like this, w1, w2, w3. Typically, we will do a forward computation where we push the input through the network, and then the network produces an output y, right? And the network also has some target that it's trying to produce, which is t. And we will measure some distance between y and t, which is this loss function, lyt. And we try to minimize that loss function with respect to all the parameters of the neural network using gradient descent. And that is what the back propagation algorithm does. It computes the gradient of lyt with respect to all the parameters in this neural network. And we do that in this backward pass using the algorithm called back propagation. I'm not going to go into that algorithm. It doesn't matter for the purpose of this talk. Just understand that all back propagation is doing is to compute the gradient of lyt, the derivative of lyt, with respect to all the weight matrix parameters of this neural network, w1, w2, and w3. That's all it's doing, right? Now, there is this algorithm that does that. We can apply this to a recurrent neural network because I've already shown you that a recurrent neural network unrolled in time in this manner is exactly a feed-forward neural network. Therefore, I can apply back propagation to a recurrent neural network as well, right? The difference is that a recurrent neural network, when you unroll it in time, is as long as the sequences. So if you have a word sequence of length 10, this thing will be 10 units long. If it's 1000, it'll be 1000 units long. That's kind of weird because in our normal feed-forward neural network, it's a fixed sort of depth, right? But the good thing, of course, is in a recurrent neural network, all these weights are shared. So it turns out that you can apply back propagation with shared weights. I'm not going to go into that. It's actually fairly straightforward, but I won't talk about that over here. The point is that it is a feed-forward neural network and so you can apply it. So what happens is, now in the forward pass, it looks like this. We're going to start over there and go all the way down here, producing outputs and hidden states and blah, blah, blah, just like I showed you in the first few slides. We get an input. It produces a hidden representation. It produces an output. That hidden representation goes to the next step. We get another input, another hidden representation, and so forth. So this is what happens in the forward pass. If I drew this as a feed-forward neural network, this would have been going from bottom to top like in my previous slide. And then we have a backward pass, which is back propagation, where we compute the derivative of the loss with respect to all the parameters of this neural network. But remember, this neural network has only three parameters. It has a weight vector, weight matrix here, a weight matrix here, and a weight matrix here. They are just copied in time in this neural network, right? Now it turns out that if you do this, it's kind of not very good because this neural network is very deep. If you convert this into a feed-forward neural network, it would be a very deep neural network. And as you may know, when you have a very deep neural network and you apply back propagation to that, you will have the problem of vanishing gradients or exploding gradients. I'm not going to go into the details as to why that happens, but it happens. And you can read about it or we can talk about it offline, right? So it is not a good idea to train a very deep neural network with back prop because you will have this problem of vanishing gradients. So instead what we do for recurrent neural networks is we do something called truncated back propagation in time, where we do the forward propagation only for a few steps, and then we do the backward propagation only for that many steps. Then we continue the forward propagation, maintaining the history of the first three steps, and we do back propagation only for three steps. So the back propagation is only done in this case for three steps at a time. In other words, we're only back propagating three layers in my feed-forward neural network, therefore taking care of this problem of vanishing gradients. So that's how recurrent neural networks are trained, all right? Now that we have a handle of what a recurrent neural network looks like, let's see how we're going to use this in NLP. By the way, when does the talk finish? 11.30, 20 minutes, okay. So these are the various different methods in which we can use a recurrent neural network. One is for language modeling, as I already said. This is called many-to-many mapping because you have many inputs. We have three words in the input and three words going out. In other cases, machine translation, which is also three words coming in and three words coming out. But in translation, you may have three words coming in and five words coming out as well, right? So this is called many-to-many. You can also have many-to-one, where you have a sentence and you're trying to classify the language, right? So the output is only one output, so which language is it? Or you could have image description, whether this is, I'm sorry, one-to-many is an image description example, where you have an image, a picture as the input to this neural network. But the output of the neural network is a sequence of words describing what this picture is. In fact, there is a lot of work going on, and we are doing some of this at work in LinkedIn as well to try to describe images using neural networks. You can look at this picture and say, this is a picture of mice eating cheese. The network can actually produce this. There are many papers that describe how to do this. In one paper, it's called show-attend and tell, I think, yeah, which will actually tell you how to do this, right? So this is a one-to-many sort of application. Now, let's think of the application for translation. In this case, I'm going to take a little more complicated example. The black cat drank milk, which is five words, and the equivalent French is la channeau, which is seven words, right? Now, there's a problem here. So far, from what we could see in our recurrent neural network, it's not easy to produce seven words when you have five words coming in. We had one word coming out from each of those input words, so it's like one-to-one, right? In other words, you have many, many, but the sequence length was the same. How do we handle something like this? So we need a different sort of model to handle this, which is called the encoder-decoder model. So in the encoder-decoder model, we are going to use a recurrent neural network. It's going to look slightly different. We are not going to care about an output from this recurrent neural network. We just have the words coming in, and each word being converted to a hidden representation, the hidden representation going to the next word, and so on, right? But this picture you've already seen what a recurrent neural network looks like. So we have a recurrent neural network for the encoder part of the neural net, and in this case, it's taking the black cat drank milk and trying to encode it, and the way it encodes it is it forms these hidden representations h1, h2 through h7, and it does some function on h1, h2 through h7, and produces one vector. We need to produce one vector because it's hard to... We need a fixed dimensional vector to represent something. If a word is like ten... If a sentence is ten words long versus three words long, I can't have a ten-dimensional vector for that and a three-dimensional vector for this, because you can't compare these vectors in any space, they're in a different space. So we need them to be in the same dimensional space. So what we do is we take this hidden representation h1 through h7, which are all in a fixed dimensional space, and convert that to another hidden representation by doing some function f. The most common function is simply to pick h7. That makes sense, right? Because by the time we've reached this point, this neural network has consumed all these other words. So presumably h7 would have some representation of not just milk, but also all these other words before it. So that's what the encoder does. It basically takes the input and converts it to a representation. Then there's a decoder, which again looks like a recurrent neural network. We've already looked at recurrent neural networks before. This guy is actually a standard language model. It takes the input, produces a word, that word comes into the next point, produces another word. It takes that previous word and produces another word and so forth. This is exactly the sort of language model that I already showed you before. So we have these two components, and now we simply connect these components in this manner. This language model, which is the French decoder, takes as an input the representation that the encoder produced. So in other words, this is a conditional language model. It is not simply a language model that is spitting out French words in a sequence. It is a language model that's spitting out French words in a sequence, conditioned on the representation that it got from the encoder network. If I did not have this connection, this language model could be trained on a whole bunch of French data, and it'll produce French sentences. It can just produce random French sentences, but with this connection, it's going to try to produce a French sentence that is similar in meaning to the black cat drank milk. That is what the encoder, decoder representation does. So the encoder encodes the information in the sentence, and the decoder produces French sentence that is conditioned on that encoded representation, and you hope that by doing this, what this is going to produce is something like La Channeau or Boudelais, as opposed to some of the random French sentence, which doesn't mean the black cat drank milk. So now the way we train this is we basically want this guy to produce. For machine translation, we have pairs of English and French sentences. We'll have a whole bunch of these pairs. We'll give these pairs to a model like this, and we will ask the model to minimize the cross entropy between whatever we want it to produce and whatever we are outputting, which is the y1, y2, y3. And I already showed you before what this would look like. It's essentially the formula I wrote on the top. Sometimes this decoder, when we are training it, we don't take the output of the decoder from the previous step. We actually give it the output, the correct output from the previous step from the pairs that we get in the training data. This is called teacher forcing. In other words, we wanted to produce La Channeau or Boudelais, right? But maybe this guy produces the word chien, which is dog instead of cat. And normally we would take chien and put it over here and it will produce something else. But we found that some fraction of the time if we actually use the correct word, typically like about 15% of the time during training, it produces better results. I'm just throwing this out so that you can know what teacher forcing is. Okay. So during inference for encoder-decoder model, what they do is let's pretend that the network has already been trained. What happens is the encoder produces a representation and the decoder produces an output y1, y2 for each time step and the output y2 is taken as an input to produce the output y3. So y1 and y2 and y3 are basically the words that had the maximum probability at those times. In other words, I'm taking the best guess that the decoder has at time 2 and giving it as input to time 3. This is called greedy decoding. But as I will show you in the next slide, we can make mistakes. We hope we'll produce the right answer, but things can go wrong. Supposing instead of l'cha, which is the cat, I produced l'cha, which is the dog, right? Then I'm screwed because when I've done that, in greedy decoding I'm down this path and I cannot produce any other sentence except l'cha something. So it's going to be the dog drank milk. It cannot be the cat drank milk at that point because you removed that option when you do greedy decoding. So instead, what people do is they use beam search, where at every point in time we keep the best k possible outputs. So let's say a cat and dog, l'cha and l'cha are both possibilities. We keep them both in our beam. We expand both of them. l'cha becomes noir and we have two paths going from the l'cha path as well, but we expand both of them and at each point in time we only keep the best k, in this case k equal to 2, possibilities. And we keep doing this. At some point we will get an end-of-sentence output as well because the neural network can also output end-of-sentence. When it outputs end-of-sentence we stop because that's the end of the sentence. We will keep doing this till we get some number of end-of-sentence hypotheses, maybe three in this example. These have one here, one here and one here. So once I've got these three outputs, I stop and I compute the probabilities of these three outputs by normalizing the total score by the length of those hypotheses and then I can compare those scores and rank, order these hypotheses from top to bottom. And when you do that, even though I made a mistake in step number two, I might be able to recover later in time because I'm keeping all these options open to me. What is the problem with encoder-decoder neural networks? The big problem here is that the encoder doesn't depend on the time. So when you each decoder time step, when I'm producing the output for each point in time in the decoder, I only have that representation for the entire input sentence. The black cat drank milk is represented as a point in some n-dimensional space. It's kind of weird to imagine a sentence being represented as a point in a 100-dimensional space, but that's exactly what neural networks do. But furthermore, it's odd to think of when I'm producing the word boo or lay, why should I consider the black cat? Lay means milk. So I should probably only be considering this word and not these other words over here when I'm producing this word. So when we try to translate, if I am translating, when I produce the word lay, I'm certainly not thinking about the word cat in my head. I'm thinking of the word milk when I'm producing this word. But this model that I just showed you doesn't do that. It actually has a representation of the entire sentence when it's producing all the words over here. Can we do better? Yes, we can. And the approach is attention. So the attention model allows you to do this in a much nicer way. And this is simply the encoder-decoder model that we already had. I've just copied it. And I'm going to show you how attention works by removing this portion. We're going to remove the way that we are computing the encoder representation and do it slightly differently. Here what we will do is we, for each decoder hidden state, s1, s2, s3, so I'm only looking at s1 over here, I'm going to try to see how close is s1 to the encoder representation at time step 1. So how close is s1 to h1? The sum function s, which is this function that I'm going to use, s is often a dot product, it can be something else, blah, blah, blah, right? Let's think of s maybe as a dot product. It's trying to see how close is s1 to h1, s1 and h1 are both vectors, hidden representations of the decoder and the encoder. Okay. How close is s1 to h2? How close is s1 to h3? And so forth. So these things, s1, 1, s1, 2, s1, 3 are scalars, which are going to tell me how close are these things to each other. Now that I know this, I can then take these similarities and use that in a softmax function to basically, all I'm going to do is I'm going to take these similarity scores and make them into probabilities. A softmax is going to take them and make them into probabilities, so they will sum to 1. So what happens is, let's say that in this case, the word the is the word that is closest to s1. So we would have s1, 2 have the maximum dot product and when you go through the softmax, it has the maximum value over there. These orange bars all sum up to 1 because we have put them through a softmax. So what we've seen here is, by doing this representation, when I'm producing the word y1 in the decoder, I'm going to be using this over here. I'm going to take this softmax representation, which is this alpha i's, that's the orange bars and I'm going to multiply these alpha i's by these hidden representations h's. So rather than simply taking h7, which is what I was doing before, I'm going to take each of the h's h1, h2 and h3, multiply that by my guess as to how close, how much should I weight h1, how much should I weight h2. So in this case, I want to weight h2 a lot when I'm producing the word the, or the, right? And when I'm producing the word sure, I want to weight h3 and h4 more than other of the hidden representations. So this is more powerful. When I'm producing the word sure, which is cat, I'm looking at the representation h3 and h4, which is for black cat as opposed to the entire representation, which was h7. And we do this over and over for all the words, right? So this is what attention is. Attention is simply a mechanism that lets the neural network focus its attention on some parts of the input when it's producing the output. Types of attention. The different ways in which we compute this similarity, you know, function s, right? One approach is to use a dot product, h dot s, in which case the encoder and the decoder hidden vectors have to be of the same dimension. They may not be because you can always say my decoder hidden representation size is different than the encoder, in which case you couldn't use the dot product. But then you could use a multiplicative approach where this can be a rectangular matrix and you will still get a, you know, scalar as an output. Or you could use what is called an additive approach, which is simply a neural network that takes the concatenation of the encoder and the decoder hidden representations, which could be of different sizes and push them through a neural network and get a scalar on the output. The ways in which you can do this, it's just, it's not critical which way you use, really. The most important thing here is that these are parameterized functions except the dot product. And so when you put this into a neural network and you learn this end-to-end, all the parameters of the neural network, including the weights of these attention layers will also be learned. All right. Now this I took from a paper, the first paper that talked about using attention for machine translation. And you can see that it works actually very nicely. If you see here, the sentence on the top you're trying to translate from English to French, the English sentence is the agreement on the European economic area was signed in August 1992, right? I don't know French. I could say that one sentence that I was saying before, but I can't say this, so I'm not going to try. But the interesting thing to note over here is that economic European is in the opposite order of European economic, right? And that happens a lot in machine translation. Sometimes word order in one language is different in the other language. Sometimes the word that comes in the first part of the sentence appears in the last part of the sentence in the target language. All these things can happen. And attention very nicely shows you what's going on, right? So when I'm producing the word economic, this model is actually focusing its attention on economic, which comes after European. But when I'm producing European, I'm focusing on European over here. So you don't see this if it was linear, like, you know, all the words, and if it was, you know, the cat chased the dog, and in the other language, the words appeared in the same order, this would simply be a straight line, right, the attention. But you can see that attention allows you to focus your attention on the right part of the input to produce the specific output. This is the cool thing about attention models. So what are some applications of attention? We have machine translation, text, we're done. Okay. It's three more minutes, okay. I'll try to do this fast. So we have machine translation, text summarization, related search, query search. These are different applications one can use for encoded decoder models. I wanted to close with an application that we are actually using this for at LinkedIn. So at LinkedIn, if you go and type in a query, like machine learning engineer, after you get all your search results, you get a bunch of other search suggestions. People also search for data scientists, data engineer, software engineer, and so forth. So we're using the exact model that I just described over here, the encoder decoder model with attention to produce these. What we do is we, to produce our training data, we take reformulated queries from sessions. In other words, somebody types a query, they don't get the result they like, they type another query, they get the result they like, they click the result. And when that happens, we get this pair of queries, the original query and the reformulated query because the result was clicked. We take these pairs and we look for all pairs where at least one word was in common because sometimes you might type something and type something completely different, right? So we have to sort of play with the data. You might have heard in many talks that you're getting, I mean, getting the data right is like 90% of the task, right? So you have to play with the data right to get your training pairs correct. So we try to keep at least one word in common. We remove, we found that these are the types of things people do. Sometimes they type this query ABC and then throw the word B. That's called a relaxed relaxation. Sometimes they expand, they type AC and the second query would be ABC and sometimes they do something more complicated. So we found that this relaxation is the easiest for the model to produce. It's easy for the model to throw out a word and we found that if we included those word, sentences in our training corpus, our model would often just produce relaxed outputs. So we removed those from our training corpus. And we tried this for English and German, 10 million pairs in English, 1 million pairs in German. Our model was a two-layer LSTM model that I just showed you with a hidden unit size of 128. It's a multiplicative attention mechanism and our results were as follows. Our baseline results were simply based upon a lookup. We'll just lookup frequent pairs for given queries from past history. In other words, this model cannot generalize to new queries, right? And we tried this encoder-decoder model and we saw that we got an 80% increase in queries that got suggested queries. We got a 26% improvement in click-through rate on the search results and we got a 1% increase in total job application rate where people look for jobs and if they get the right job, they may apply for it. And that's a good thing for us at LinkedIn. We measure how, you know, job application rate is a metric of success. So, and similarly for German, also we got a 3.2% increase in successful search results. So just a real-word example of how encoder-decoder models with attention has actually helped in LinkedIn. These models have also been used for speech recognition at Google and other places. You know, embedded speech recognition system that's used in the phone actually uses an encoder-decoder model like I just described. So in summary, it's a very powerful approach. There's many applications. But going forward, I think there's many new techniques like BERT and transformers that have come into play that I think you should be more familiar with. It's still a good idea to know about encoder-decoder models with attention because the components of this model are used in all these other models as well. Thanks. I think I went a little over time, but thanks so much for your attention.