 Finally, how do we train this? Well, what we're going to do is take all of the sentences in our training set and randomly remove 15%. Why 15%? Because it works better than 10% or 20%. Okay? So we have our special start of sentence tokens, CLS. I went to the delete it and bought a deleted a build period and then the separator tokens. And what that's the input, which we will then go at end code in our usual fashion and give attention weights. The output will be then, well, I don't care about everything except the mask, predict the mask value store and predict the masked value court. So you're trying to do on your loss function is to minimize to maximize the log of the probability you assign to the word that actually was the correct masked word. Note this requires no recurrence. Note that this in terms of predicting this word store is going to use an encoding of the whole piece here and right with a self attention and encoding of what we've predicted on the rest of the piece here. And then a bunch of other stuff together. And we'll then use one big gradient descent method, but it can all work fast and in parallel. And the fact that we have positional encoding, let's us know that the word before this missing one, the mask one was in fact the and the word after it was and knows where things are. And it knows that milk is pretty far away from this one, but much closer to that one. And note that milk is relevant to being at store and milk is even more relevant to being a court. So the location matters.