 What is cross-attention versus self-attention? Let's say we want to perform translation from English to a language called Kannada that looks like this. During attention, each word is encoded as a query, key, and value vector. Query vectors encode, what am I looking for? Key vector encodes, what can I offer? Value encodes, what I actually offer during attention. During self-attention, each English word is converted into a query, key, and value vector. And these are used to create vectors that better understand context. During cross-attention, every English word is converted into a key and value vector as before, but every Kannada word is converted to a query vector. The final vectors will encode the Kannada words, taking into account the English words.