 Attention is one of the key breakthroughs that has enabled large-scale natural language processing. It addresses one of the core problems of how to handle long sequences and enables modern techniques like transformers. So, what's the problem with seek-to-seek models? They sound great. You take in a sequence of words. I went to the blah, blah, blah, blah, blah. Make a hidden output. Take that, feed it into a decoder. It then translates it and says, eek, enough, whatever, etc. This works great for five words, ten words. It's terrible for a whole Wikipedia article. You cannot take a thousand words of text and compress them into a single 300-dimensional vector. And the longer the text is, the worse it works. Plus, it's slow and has lots of recurrence in the middle. So we need some technique that actually does not require the computer to remember everything it's seen and store it in a single vector. And to be clear, humans can't remember more than seven plus or minus two words either. You can't remember an entire Wikipedia article and recite it verbatim and then translate it. So, what will we do? The way that humans work when we translate. You'd like to ideally translate one-to-one. I becomes ich, went becomes ging, tu becomes nach, and becomes und, so on and so forth. Now, that roughly works for most languages but not quite, because there's not a perfect one-to-one alignment between English and any language. But it roughly works. If you think about translating, you go le accord sur la bleu. I can't pronounce it, c'est l'accord. The two words becomes the agreement, sur le becomes on, la becomes the. But now, we have three things that don't quite match up. Zone économique européen becomes European economic area. It's swapped. We're going right down one-to-one and then we've got three words that switched. Or sometimes there's two words. A-ette becomes what? Swaps up. Then other words like sine becomes signed and each one here does roughly one-to-one. So, roughly things line up. And as we're translating, what are we doing? We're walking over this piece here saying at each point we've translated one word here. What's the next word we should find? European. Where does it come? Oh, it's going to be this one. Three down. Okay, so what does this look less alignment problem look like in a neural net? We have a sequence of words in the source language X or rather they're embeddings. Then we're going to use, for example, a bidirectional LSTM and have a sequence of hidden states going forward and a sequence of hidden states going back and concatenate them. We have a hidden state for each source word. We want to then say we are so far on the translation we've translated each of these words up to signed. What's the next word? Take the hidden state of the decoder here and then match it against all of the embeddings here to find weights and find all of that word matches the best. Take that word with a lot of weight, the other with a little weight, feed them into the decoder and go out again. We're going to run through this in detail later, so don't worry if that was a little fast. So again, what's the idea? Encoder, take every word, embed them either with a single piece or with a forward neural net or with a bidirectional net but get some embedding from them. We will then walk through word by word on the decoder. Cool animation here if you want to see it. Decoder, the decoder takes this and then learns given where the decoder is currently for its hidden state, how much attention, how much weight should I pay to each of these words and then use that to generate the next decoding word and walk through it. You can do the exact same thing, not less translating Chinese to English but translating sounds. This is a spectrogram, sounds going into your microphone to the words or letters coming out and you can see that at this point the network is paying attention to the sound here as being relevant for these letters over here. So you can translate speech to text, text to speech, anything to anything. Cool. Another very closely related use of attention when it's a little easier maybe to understand in terms of when we start going through the details is question answering. In question answering we have a query, a question who died in this document along with a pediodocument, here it's got a bunch of name recognition entity 423, entity 261, blah, blah, blah, lots of entities in it. And you could then say we want to take a question, find some deceased sailor, where does he show up in this place, find the relevant piece and then take the embedding of the question, the embedding of the local piece here or pieces or locations because there may be a couple places this person was mentioned and use those as input then compute an answer about who the person is or what they did, right? And the idea is that we will have attention weights that are given to each local word or byte pair or whatever the local measurement of representation is. Cool. So how does that work? The formal piece is going to evolve queries, keys and values. Always a little bit confusing but not too bad. So the query, think of the embedding of the question. The key, think of embedding all the words in the Wikipedia article, each embedded quite locally. Then we will compute by one of a couple different methods a weighting and attention score which says for each one of these keys, each one of these local embeddings of words, maybe a context-sensitive embedding of words, how much weight should it be given? Given all of these attentions, we're going to soft a maxim so it gives us a probability distribution or a weight across these where the weight's all summed to one. We'll multiply each key, each embedding of the words in the text by that weight and get what's called a value. It's just the constant a for each different one times the keys. And then we will take the weighted combinations, sum up all these values, put them in along with the query to a neural net and predict the output. Yes or no or found it or whatever the question is. Cool. So there are two main attentive models. One comes in where the query, the question you're answering, asking its embedding and the keys, each local word in the text may have different dimensions. In that case we multiply each of them by our weight matrix, take a squashing function tan h with it, multiply that, dot products we get a value that gives us an attention weight for the query and the key. We'll compute that for each query and for each key and then soft maxim. Or instead what we could do is use a dot product attention. If the query and the key are the same length, each of size 300, we just take the dot product between them, usually divide that by the square root of 300, the length of them to keep it normalized, and then we'll also soft max that. So two different ways, but both of them are measuring something about how similar is this query to the query to this key, how much does this query tell me about that key? Cool, so for machine translation, what does it look like? The query is the embedding of the, precisely the hidden layer right before the next thing you're trying to translate to in the target word. And the keys are the embeddings of all of the words in the source document or source language, right? So these are done all in parallel, nice, right, fast. So we embed all the source words as the keys. We then embed the hidden layer feeding into the thing we're trying to ask about. And then we will compute some attention, for example, the soft max over all of these queries times keys of how similar is the query times the key, right? This is just a whole bunch of dot products done in parallel and then normalize the square root of the dimension 300 of the embeddings. Then we multiply every key by its corresponding attention. We get the value. We then feed the value, weighted value sum together times this piece here and get the function which gives us a soft max over the next word to translate, and we move forward. So let me walk through that once again in words. If we're doing machine translation for each step of the decoder as we decode until we get a stop, we compute the encoder hidden states for every encoder hidden state. That's all the words in the source. We collute the alignment scores between the previous decoder hidden state, right? We're doing a translation of the decoder. We're about to find the next word. We take what's the hidden state of the decoder right before the word. I'm trying to do that. We align it with every word in the encoding. I think we're trying to translate from the source language. We multiply them in soft maximum. Then we take that resulting input vector plus the previous decoder output hidden state, put that into an RNN, produce the hidden state and output. This is the same thing I said before just in words. That was the hard part. One last conceptual piece which we'll want to use as we do these in the real world is that often, after you're doing different tasks or different complicated tasks, you want to pay attention to lots of different places in the source language. So instead of having a single attention, like a single CNN kernel that you walk over things, we'll have multiple attention models called multiple heads. So we'll have two or a dozen typically different attention heads, each of which take the same queries, right? The output of the decoder, the same keys, the hidden states of all the words in my input, the values which are often just the attention weighted versions of those, lots of other fancy functions we'll see of these. Pit those in and we do that once and twice and 12 times. We now have 12 different attention models, all identical copies but with different weights. They pay attention to different aspects in the document. What entities were mentioned before? What sort of co-references there? What was the mood, positive or negative? What is it, singular or plural? So it can attend to all sorts of pay attention to the verbs in English to see if it's singular or plural. So you can pay attention to lots of things at once, feed those then concatenate them, feed them all back into a final neural net. That then makes the prediction. And we're going to see lots of concrete examples of these. Okay, so what have we seen? If you want to embed a long document, do not do a encoder, decoder and go all the way through the document and save the hidden state. Instead, embed every word locally, using your favorite local embedding piece there, this can all be done in parallel, embed the query, either a question if you're answering it or the current state of the decoder and then match them up, pay attention, and then use a prediction to make either the answer to your question or the next word in your translation. This turns out to be beautifully paralyzable, which is what you want because you want to run these things on fast GPUs and it allows you to handle billions of words in training large, semi-supervised models, which is really the key to the current amazingly good NLP performance.