 ELMO vs BERT ELMO is a bi-directional LSTM network, while BERT is a stack of transformer encoders. ELMO is trained to be a language model, while BERT is trained on two tasks, mass language modeling and next sentence prediction. ELMO is slow to train as it relies on the back propagation through time to learn as it consists of LSTM cells. Transformers are quicker to train as they make use of parallelization. ELMO may not understand true context as it learns the forward and backward context and then concatenates them. BERT, on the other hand, is deeply bi-directional as it uses attention to learn both forward and backward context simultaneously. Check the channel for more NLP content.