 Greetings fellow learners, before we begin our titular tale of transfer learning, I have a thought-provoking question for you. What activity or skill was easy for you to pick up, but not necessarily for anyone else? For me, I would say it was teaching, and I attribute a lot of this to the public speaking I had been doing since I was growing up. By the time I was in 10th grade, I was speaking in front of thousands of people hosting events when I was in school studying in India, and eventually I started hosting other events which you can still find on my channel till this day. So public speaking is definitely a transferable skill I used for teaching on YouTube, and so teaching came a little bit quick to me. But please comment down below what your superpower skill is, and I would love to know you better. Now this video is going to be divided into three passes where we start with an overview of transfer learning, followed by some detailed example on how we would exactly solve a transfer learning problem, and then we're going to code the same thing out in the past three of the explanation. So stay tuned, we're going to learn a lot. This is a neural network, so let's say that we want to train this network to take in a given question and produce an answer. To do so, we construct a data set of about one million questions along with their answers. Now during the training phase of the network, we will pass question-answer pairs to the model, the model updates its parameters, and we repeat this process until the model eventually learns to answer questions. Now during the inference phase, the model can take an unseen question, and then it can produce an answer. So this is great. Now let's say that we want to perform another related task, where we translate a given English sentence to a French sentence. So to do so, we will construct a data set of, let's say again, one million English sentences with their French translations. We start with an untrained network, and during the training phase of the network, we pass English French pairs to the model, and update the model parameters. Now during the inference phase, the model can take an unseen English sentence, and then produce a French translation. This is great again. But now one major pain point here is that building data sets of this one million size can be very difficult. And if we wanted to now solve the problem of, say, translating English to Spanish, this can be difficult because we would need to collect one million examples again from scratch. Now transfer learning helps mitigate this problem. So let's say that we want to build an English-Spanish translator. We can first train the model on one problem, like English to French, and then we can fine-tune this model on the problem we want to solve, which is English to Spanish. And this way, knowledge can be transferred using a model trained on one problem as a starting point. And hence, we don't need as much English to Spanish data for the model to actually learn. Quiz time! Have you been paying attention? Let's quiz you to find out. What is the benefit of transfer learning? A. It reduces the need for computational resources. B. It overcomes data limitations. C. It enhances model interpretability. Or D. None of the above. Please comment your answer down below and let's have a discussion. If you think I deserved and you love learning, please do consider hitting that like button because it will help me a lot. That's going to do it for pass one and quiz time for now, but keep paying attention because I will be back to quiz you. Let's illustrate transfer learning by teaching a model to perform an NLP task, which is question answering. In the simplest form of question answering, the network is given a context and a question. And the output answer can be extracted from the context. So the output is essentially just going to be two numbers. The first number is going to be the index of the start position and the second number is going to be the index of the end position. For this type of network, we will use BERT with transfer learning. For details on the architecture of BERT, you can check out this video. And for now, just know that BERT is a stack of the encoder part of the transformer neural network. And it is trained in two phases. We have a pre-training phase and a fine tuning phase. So let's talk about each of these. So during the pre-training phase, we will take a dumb network and we'll train it on two problems, mass language modeling and next sentence prediction. So in mass language modeling, the model will take in a masked input and determine what those masks are. In next sentence prediction, the model will take in two sentences and determine whether the second sentence logically follows the first. And once it is trained on these two problems, the BERT model is said to be pre-trained. Now you as a user don't usually need to pre-train a model yourself and these models can be downloaded online and then fine-tuned for your use case. During the fine tuning phase, we will take this pre-train network and train it further on question answering. So the dataset would have a question plus context. We can catenase these together. This stream of text is then broken down into individual units called a token. We then pass these tokens into BERT and in code, we're going to use a version of BERT called distil BERT that converts each token into a 768 dimensional embedding. When I say embedding, they are basically vectors that is a set of 768 numbers that represent the meaning of a word. Now each of these 384 tokens are then mapped into a two-dimensional vector. Now the first number will determine the probability that this token is the start token of the answer and the second number determines the probability that this token is the end token of the answer. So they are numbers between 0 and 1. And we have 384 of these two-dimensional vectors. We can then take the maximum values across these columns to get the start position and end position. And there's a little bit of post-processing to handle some edge cases. Overall, the big picture here is BERT requires substantially less question-answer data than if we had trained a model from scratch. Quiz time! It's that time of video again. Have you been paying attention? Let's quiz you to find out. Which of the following architectures can make use of transfer learning? A. BERT B. GPT C. Feed Forward Neural Networks Or D. All of the above. Comment your answer down below and let's have a discussion. That's going to do it for quiz time for now, but keep paying attention because I will be back to quiz you. In this past, we are going to fine-tune BERT on the question-answering data set. In this notebook, we describe a few processes. The first is loading of the training data. Then we're going to have to pre-process that training data so that it can be fed to a model. Then loading and training of the model itself. And then some post-processing to be able to interpret the results from the model. And then performance evaluation. So let's go through each, starting with just loading the data set. In this case, we're going to use squad version 1, which means that the answers will always be present within the context that is provided so as to simplify the process. We're going to use distilled BERT. And the batch size is 16, which means that we can pass 16 examples to the network all in parallel, and we can get 16 results in parallel. This data set has 87,000 training examples and 10,000 test examples. And so this shows that we only need so many examples to actually fine-tune our model instead of the millions that we would potentially need if training from scratch. And here is a record of how exactly that training data looks. The important part is the context, the question, and then the answer, where the answer is, we can see that the text that the answer corresponds to. And this text is directly present within the context itself. So they consider to be unscriptural doctrines that is present right over here. And then it's starting position is also given over here, which is the position of they in this case. Step one of loading the data set is complete. Now we want to go to step two of actually pre-processing the training data. So first thing we want to do is concatenate the question and the context. And then we will break that text down into individual tokens via an auto tokenizer. So you could see that here, this is the question, this is the context. We have broken it down into individual tokens. And then we have padded it using a padding token. This is required because we want some fixed size inputs, mathematical inputs, to be passed into the model at a given time. And then we also want to fetch the start position. This will indicate the index of the start of the answer and then the index of the end of the answer because it's already present in the context. So for example, 33 to 39 for this fifth example right over here, which is the same one right over here. It just basically says that I think a golden statue of the Virgin Mary answers this question and this golden, I think this is the 33rd token, whereas this is the 39th token. So how do we go from this to creating this tokenized version plus padding, plus getting the start and end positions right over here of the answer? Well, we have a function that does all of that right over here, which is going to be called prepare training features. So we were able to load the data set and we are now able to pre process our input data and our output data. Now we need to load and fine tune our model. And this here is the architecture. So what happens here is we now have the input tokens, which were the 384 tokens. Each of those tokens is going to be mapped to a 768 dimensional embedding vector. Position embedding allows the tokens to encode positional information because their position matters. LayerNorm is going to help stabilize training. Dropout is going to act as a regularizer to prevent or mitigate overfitting. Then we have the transformer and this consists of six encoder transform blocks because this is BERT. And with the encoder transformer, it's going to perform a series of attention operations in order to better encapsulate the meaning of every single token. And that's why the output of every single token is going to be 768 dimensions, better encapsulating meaning. If you want to get an idea of exactly how every single component works, I highly recommend you take a look at the transformers from scratch playlist. Now we are going to have 384 tokens and each of those is going to be mapped to a two dimensional output, which indicates the probability of start index and the probability that this token is the end index. So that's the model that we're dealing with. And we can actually start the training process fairly easily. So we have loaded the data set. We have pre-processed the data. We now have trained the model. And now let's actually look at post-processing and prediction of the model itself. So this code over here is going to take a batch of examples and generate some predictions. Now this is going to be a list of 384 tokens, which indicates the probabilities of each being the start index and another 384 token list of it being the end index. You can see, and it also has a batch size because we're passing 16 examples at a time. But we really only need the position, like the true position for every single column. We can actually just take like the maximum value over here. So that means that, for example, this is going to be a list of 16 examples in the batch. So for the first example, token 46 is the start and then it ends at token 47. That's the answer. Then the second example, the answer starts at token 57 and ends at token 58. For the third example, the answer starts at token 78 and ends at token 81 and so on. And so now that we've done the load data set, the pre-processing, the model training, the post-processing, we can now do some hardy evaluation because we had our evaluation set of about 10,000 examples and we could see how much there was an exact match to the actual answers generated by the model and the answers that are actually true. And we can also generate some other evaluation metrics. And once done, you can push your fine-tuned model to Hugging Face Hub. Squeeze time. All right, this is going to be a fun one. What is the primary advantage of using Distill BERT over BERT? A, Distill BERT is larger and more powerful. B, Distill BERT provides better interpretability. C, Distill BERT is faster and has a smaller memory footprint. Or D, Distill BERT has a higher capacity for model complexity. Comment your answer down below and let's have a discussion. And if you think I do deserve it, please do consider giving this video a like because it'll help me out a lot. Now it's going to do for quiz time and pass three of the explanation of the video. But before we go, let's generate a summary. Transfer learning involves pre-training a model on a general task. Then further fine-tuning the model on a specific task. This way we don't need as much task specific data. Now BERT uses transfer learning by pre-training on mass language modeling that is given a mass input, determine the masks. And next sentence prediction that is given two sentences, determine if the second sentence logically follows the first. And we can use less data to fine-tune on other NLP tasks like question answering. And that's all we have for today. The links to the code and all other resources will be provided in the description down below. So do check it out. If you want to know more about how BERT works, I highly suggest you check out this video right over here. And you can watch the rest of the NLP playlist as well. Thank you all so much for watching. If you do think I deserve it, please do give this video a like. Thank you and I will see you in the next one. Bye-bye.