 GBT and transfer learning. Language is complex and difficult to model. GBT models are complex enough to potentially understand such language. They are transformer decoders stacked dozens of times over. Since language is complex, learning tasks like translation and question answering need a ton of data which can be very hard to come by. Instead, it would be nice to have some model with initial parameters that are already close to what we need. Then we can tune the model with little data depending on the specific problem we want to solve. We are effectively transferring knowledge here and hence the name transfer learning. To get the initial model, we train on language modeling so the model has an understanding of language. This is pre-training. To get the final model, we need some examples of data for tasks like translation. We can then fine-tune these parameters. This phase is fine-tuning. In the end, we get a good model without requiring too much fine-tuning data.