 Okay, we're here at last. Transformers and BERT. We are up to the state of the art in deep learning, at least if you don't have a really big computer like OpenAI and lots of training, but the ones that one wants to use. So what are people now doing since like 2019? Transformers. And the idea is a general extension of seek-to-seek models, but ones that use attention and a couple other tricks to parallelize. And the cool thing about transformers is that you can take in all sorts of different inputs and get all sorts of different outputs. So here's a cool unified text-to-text transformer. It takes in an instruction, just a string, translate English to German, that is good, throws it in and it will put out dasesgut. Or you type in something that says summarize and give it a sequence of stuff and it gives you a summary. Or you give it in something to ask about how is this sentence, the course is jumping well and it says not acceptable. So what do we want? We want the universal answer-all-questions piece, which takes in anything with a prefix and a sentence and gives you the answer. Awesome. Cool and neat. We're going to start by doing something that basically learns, yes, something that looks like a language model, but not a recurrent language model, an attention-based language model. So BERT, the core transformer that's most widely used now, uses multi-headed self-attention. We've covered attention already, but we'll add self-attention. It uses positional encoding because many of the methods that we looked at that are attention-based don't know where the word is, how far it is, how close it is. It puts them all in. It uses masking, which we'll cover, which is basically something that we saw before in auto-encoders. Just hide some word and predict the missing word and it uses a word piece tokenization, another variation of the byte pair encoding, where the word tokenization would be represented as token plus some hashtagization. So it breaks it up into subword pieces. In your worksheet, you're mostly going to use Roberta, which has the exact same architecture as BERT. It uses a similar, actually a byte pair encoding, very much like word piece tokenization, and it's trained on more data, more, more, more, more, more data, better models. Cool. So BERT was underfixed. It didn't have enough data, even though it had a lot of data. Cool. So self-attention. What is self-attention? Self-attention, instead of translating one sentence in English to one sentence or the next word in Chinese, we're going to take the same sentence. The animal didn't cross the street because it was too tired, and we will use that where we're trying to correlate the current word position vector, the embedding of it, with all of the position word vectors in the same sequence. It's paying attention to itself, and it will then learn weights in a way that does the best job possible of masking, of predicting this if it were missing. And it turns out that the attention that it pays will be a lot towards like animal, and a little bit to the, not quite as much to the word street, a little bit to it, and a little bit in the case of period. Who knows. So note what happens is that instead of translating from one word sentence to a different language, we're going to go from the same sentence to itself, asking if something is masked or hidden, which piece should I do? This should look sort of like the skipgram encoding we saw earlier for word to vector and fast text, in the sense that it says some version of how correlated is this word with every other word in its context. But we're going to do this with multiple heads. So we could take that same word in the same sentence, with the same architecture, but a different head and the different head might pay less attention to the animal and a little more attention to street and some attention to other things. So each head will attend somewhat differently. How does it decide what to attend to? Well, of course, it's going to use stochastic gradient descent to optimize a loss function, which will be to, in this case, predict words which are masked, but we'll get there in a little bit. Okay, so the transformer attention for seek to seek actually has three different little attention pieces. We have an encoder, which will do a encoder attention just like we saw, takes any sentence going in in the seek to seek and encodes each word with a bunch of different attention heads, say a dozen such that predicts other words just like we saw. And then there's a decoder attention. We have a decoder model separate from the encoder model. The decoder model will also pay attention to other words in the decoder sentence, the output sentence, right? Input sentence pays attention to itself. Output sentence pays attention to itself. And then we'll deep breath have a third encoder decoder attention that takes in information from both of these together and uses those. We'll put them all together a little bit later. Secondly, we need to have positional encoding. The problem is that attention models don't preserve the sequence information. They look at all the other words in the input and say which ones are important and they forget which one was the first or the second or the third. Now you could try and take each word and say, okay, give it a label input one or two or three or four. But it turns out to work better to the following slightly arcane encoding that builds up a representation of the where each word is in the sequence using a bunch of signs and cosines. So if we have n tokens in a sequence, each of dimension D, say 300, like a word of VEC, we would then say what we're going to do is build up an n by D matrix where for each position, each word, the 0th first second up to the, in this case, the 60th word is encoded by some mixture of the even numbered elements of that 300 dimensional vector being signs of something and the odd ones being cosines of them. And these get bigger and bigger as you get bigger through the sequence. What you're doing is having a slow sign and a faster sign and a faster sign and a faster sign and short ones. And this turns out to give a nice representation that lets the neural net keep track of where it is. And in sort of weird neural net world, what they do is take the matrix X, which was the original n by D, the D dimensional encodings, the actual encodings like word of VEC of all n words, and add in the positional encoding, add them up together, keep in roughly the same information, and then just pop that into your neural net. Details don't matter. Fact that does matter. You need to tell the model the order that the words are in so it can actually take advantage of that and compute how far away it is, for example, or whether it's the beginning of the sentence or the end of the sentence where words have slightly different meanings based on where they show up. Finally, how do we train this? Well, what we're going to do is take all of the sentences in our training set and randomly remove 15%. Why 15%? Because it works better than 10% or 20%. Okay, so we have our special start of sentence tokens, CLS. I went to the delete it and bought a delete it a build period and then the separator tokens. And what that's the input, which we will then go and encode in our usual fashion and give attention weights, the output will be then, well, I don't care about everything except the mask, predict the mask value store and predict the masked value court. So you're trying to do on your loss function is to minimize to maximize the log of the probability you assign to the word that actually was the correct mask words. Note this requires no recurrence. Note that this in terms of predicting this word store is going to use an encoding of the whole piece here, and right with a self attention and encoding of what we've predicted on the rest of the piece here, and then a bunch of other stuff together, and we'll then use one big gradient descent method, but it can all work fast and in parallel. And the fact that we have positional encoding, let's us know that the word before this missing one, the mask one was in fact the and the word after it was and so it knows where things are. And it knows that milk is pretty far away from this one, but much closer to that one. And note that milk is relevant to being at store and milk is even more relevant to being a court. So the vocation matters. Cool. So now we put the whole thing together. Bert's got a couple of different flavors. The Bert base, the standard one that lots of people use or the Roberta equivalent of it has in it something that takes in a sense, sense, source sentence embeds all the words in it, a target sentence embeds all the words in it adds in the positional encoding. It then uses a 12 attention head model that pays attention in 12 different ways to the embedding, especially once here, yet then has 12 layers. So it's a deep neural net goes through and takes each of these attentions adds them normalizes them combines them gets an overall encoding does the same thing on the output the target. But the target has now been masked 15% of the words in the target were hidden. It takes all this piece it does a self intention it does an encoding to do that. It then takes the encoding from the encoder, the encoding from the decoder feeds them into another block of multiple attention heads passes to another deep looting. And in the end makes a prediction to predict the masked words. And then you train the whole thing by stochastic gradient descent. And the nice basic one has 110 million parameters. But if you have a bigger computer, you can do one that's got 340 million parameters, ready to download. Cool. You will get to use these pre fab in the assignment now on the homework you can try and assemble these things yourself. These are all components you've known and use. You just glue them all together and run the stochastic gradient descent.