 BERT vs GPT Both are derived from the transformer neural network architecture that can convert sequences to sequences. A sequence is typically an order set of data, and in natural language processing, this could be a set of words to form a sentence. BERT and GPT specifically focus on natural language. As far as their architecture is concerned, BERT is a stack of transformer encoders and the GPT is the stack of transformer decoders. Both are pre-trained to understand language, then fine-tuned to understand a specific task in that language. So BERT, for example, is pre-trained on natural language inference and sentence text similarity, whereas GPT is pre-trained on the language modeling objective of predicting the next word in a given sentence. They are then both fine-tuned with supervised data, although GPT can alternatively use meta-learning to make decisions without fine-tuning.