 So now we put the whole thing together. BERT's got a couple of different flavors. The BERT base, the standard one that lots of people use, or the Roberta equivalent of it, has in it something that takes in a source sentence, embeds all the words in it, a target sentence, embeds all the words in it, adds in the traditional coding. It then uses a 12 attention head model that pays attention in 12 different ways to the embedding, especially ones here, yet then has 12 layers. So it's a deep neural net goes through and takes each of these attentions, adds them, normalizes them, combines them, gets an overall encoding, does the same thing on the output, the target. But the target has now been masked. 15% of the words in the target were hidden. It takes all this piece, it does a self-intention, it does an encoding to do that. It then takes the encoding from the encoder, the encoding from the decoder, feeds them into another block of multiple attention heads, passes it to another deep learning, and in the end makes a prediction to predict the masked words. And then you train the whole thing by stochastic gradient descent. And the nice basic one has 110 million parameters, but if you have a bigger computer, you can do one that's got 340 million parameters. Ready to download. Cool. You will get to use these prefab in the assignment now and the homework you can try and assemble these things yourself. These are all components you've known and used. You just glue them all together and run the stochastic gradient descent.