 Transformers vs Burt and GPT. Transformers have a few disadvantages. They weren't designed to be necessarily language models. They were designed to be sequence-to-sequence models. Sentences just happen to be sequences of words so we can technically solve language problems with transformers. But even so, language is complex. Transformers would need a lot of data to learn these representations from scratch, and they might need to become more complex in order to capture the essence of language. This is where Burt and GPT shine. Burt is a stack of transformer encoders, GPT is a stack of transformer decoders. They were both designed to understand language in their pre-training phase and then fine-tune a specific task during their fine-tuning phase.