 Okay, we're here at last. Transformers and BERT. We are up to the state of the art in deep learning, at least if you don't have a really big computer like OpenAI and lots of training, but the ones that one wants to use. So what are people now doing since like 2019? Transformers. And the idea is a general extension of seek-to-seek models, but ones that use attention and a couple other tricks to parallelize. And the cool thing about transformers is that you can take in all sorts of different inputs and get all sorts of different outputs. So here's a cool unified text-to-text transformer. It takes in an instruction, just a string, translate English to German, that is good, throws it in, and it will put out dasesgut. Or you type in something that says summarize and give it a sequence of stuff and it gives you a summary. Or you give it in something to ask about how is this sentence, the course is jumping well and it says not acceptable. So what do we want? We want the universal answer all questions piece, which takes in anything with a prefix and a sentence and gives you the answer. Awesome, cool and neat. We're going to start by doing something that basically learns, yes, something that looks like a language model, but not a recurrent language model, an attention-based language model. So BERT, the core transformer that's most likely to use now, uses multi-headed self-attention. We've covered attention already, but we'll add self-attention. It uses positional encoding because many of the methods that we looked at that are attention-based don't know where the word is, how far it is, how close it is, it puts them all in. It uses masking, which will cover, which is basically something that we saw before in auto-encoders, just hide some word and predict the missing word. And it uses a word piece tokenization, another variation of the byte pair encoding, where the word tokenization would be represented as token plus some hash tagsization. So it breaks it up into subword pieces. In your worksheet, you're mostly going to use Roberta, which has the exact same architecture as BERT, uses a similar, actually a byte pair encoding, very much like the word piece tokenization, and it's trained on more data, more, more, more, more, more, more data, better models. Cool. So BERT was underfixed. It didn't have enough data, even though it had a lot of data. Cool. So self-attention. What is self-attention? Self-attention, instead of translating one sentence in English to one sentence or the next word in Chinese, we're going to take the same sentence. The animal didn't cross the street because it was too tired. And we will use that where we're trying to correlate the current word position vector, the embedding of it, with all of the position word vectors in the same sequence, paying attention to itself. And it will then learn weights in a way that does the best job possible of masking, of predicting this individual missing. And it turns out that the attention that it pays will be a lot towards like animal, a little bit to the, not quite as much to the word street, a little bit to it and a little bit in the case to period. Who knows. So note what happens is that instead of translating from one word sentence to a different language, we're going to go from the same sentence to itself, asking if something is masked or hidden, which piece should I do? This should look sort of like the skipgram encoding we saw earlier for word to vector and fast text in the sense that it says some version of how correlated is this word with every other word in its context. But we're going to do this with multiple heads. So we could take that same word in the same sentence with the same architecture, but a different head and the different head might pay less attention to the animal and a little more attention to street and some attention to other things. So each head will attend somewhat differently. How does it decide what to attend to? Well, of course, it's going to use stochastic gradient descent to optimize a loss function, which will be to, in this case, predict words which are masked, but we'll get there in a little bit. Okay, so the transformer attention for seek to seek actually has three different little attention pieces. We have an encoder, which will do a encoder attention just like we saw, takes any sentence going in in the seek to seek and encodes each word with a bunch of different attention heads, say a dozen such that predicts other words just like we saw. And then there's a decoder attention. We have a decoder model separate from the encoder model. The decoder model will also pay attention to other words in the decoder sense, the output sense, right? Input sentence pays attention to itself. Output sentence pays attention to itself. And then we'll deep breath, have a third encoder decoder attention that takes in information from both of these together and uses those. We'll put them all together a little bit later.