 How can we use parallelization for sequential data? In 2017, the transformer neural network architecture was introduced. The network employs an encoder-decoder architecture much like recurrent neural nets. The difference is that the input sequence can be passed in parallel. Consider translating a sentence from English to French. With an RNN encoder, we pass an input English sentence one word after the other. The current word's hidden state has dependencies in the previous word's hidden state. The word embeddings are generated one time step at a time. With a transformer encoder on the other hand, there is no concept of time step for the input. We pass in all the words of the sentence simultaneously and determine the word embeddings simultaneously.