 Why Transformers Over Recurrent Neural Networks? RNNs are slow to train and to infer. This is because words are processed one at a time, so longer sentences take longer time. RNNs also don't truly understand the context of a word. Normally, these recurrent neural networks just learn from words that come before it, but the context of a word depends on the words that come before and after it. Even bidirectional RNNs suffer from this, as it simply learns left-to-right and right-to-left context separately and then concatenates them. So it isn't learning the true meaning of the word. Transformers can solve both problems. They can take input words in parallel, and the encoder can generate word vectors in parallel. This can speed up processing, training, and inference. Also, the attention mechanism can help learn bidirectional context in a more true sense, so the word vectors that are generated after the encoder are so much more better at encapsulating the true meaning.