 Let's talk about the first phase of the transformer neural network architecture in detail. We take words simultaneously and pass it to the encoder, but before doing so each of these words is encoded into a one-hot vector of vocabulary size across the maximum length of the sequence itself. Each word is then transformed into a 512 dimensional vector where we here is a set of learnable weights. To each word vector we will add a position encoding vector of the same size. They are derived from sine and cosine functions. We then get new vectors that encapsulate meaning as well as position of the word. For each vector we then create query, key, and value vectors. These weights are learnable parameters. All of these vectors are later passed into an attention unit to compute self-attention in the encoder. They'll eventually generate word vectors that better encapsulate the meaning than their inputs.