 The Transformer decoder in 60 seconds. Let's say we're creating a translator from English to some language called Kanada. Looks like this. For each word, we add position encodings. We split each word into query, key, and value vectors. Perform masked self-attention. We now have vectors of Kanada words that better encode context. From this, we create query vectors for each Kanada word. And from the encoder, which contains vectors for every English word, we construct key and value vectors for every word. And we perform cross-attention. At the end, we now have vectors for every Kanada word that have the context of English words. And we repeat this process of self-attention and cross-attention multiple times to capture language complexity. And then we can actually predict some words that will come next.