 Today we're going to talk about BERT. So let's jump into it. This is the transformer neural network architecture that was initially created to solve the problem of language translation. This was very well received. Until this point LSTM networks had been used to solve this problem. But they had a few problems themselves. LSTM networks are slow to train. Words are passed in sequentially and are generated sequentially. It can take a significant number of time steps for the neural net to learn. And it's not really the best at capturing the true meaning of words. Yes, even bi-directional LSTMs. Because even here they are technically learning left to right and right to left contexts separately and then concatenating them. So the true context is slightly lost. But the transformer architecture addresses some of these concerns. First, they are faster as words can be processed simultaneously. Second, the context of words is better learned as they can learn context from both directions simultaneously. So for now, let's see the transformer in action. Say we want to train this architecture to convert English to French. The transformer consists of two key components, an encoder and a decoder. The encoder takes the English words simultaneously and it generates embeddings for every word simultaneously. These embeddings are vectors that encapsulate the meaning of the word. Similar words have closer numbers in their vectors. The decoder takes these embeddings from the encoder and the previously generated words of the translated French sentence. And then it uses them to generate the next French word. And we keep generating the French translation one word at a time until the end of sentence is reached. What makes this conceptually so much more appealing than some LSTM cell is that we can physically see a separation in tasks. The encoder learns what is English, what is grammar, and more importantly, what is context. The decoder learns how do English words relate to French words. Both of these, even separately, have some underlying understanding of language. And it's because of this understanding that we can pick apart this architecture and build systems that understand language. We stack the decoders and we get the GPT transformer architecture. Conversely, if we stack just the encoders, we get BERT, a bidirectional encoder representation from transformer, which is exactly what it is. The OG transformer has language translation on lock, but we can use BERT to learn language translation, question answering, sentiment analysis, text summarization, and many more tasks. Turns out all of these problems require the understanding of language so we can train BERT to understand language and then fine tune BERT depending on the problem we want to solve. As such, the training of BERT is done in two phases. The first phase is pre-training where the model understands what is language and context. And the second phase is fine tuning, where the model learns, I know language, but how do I solve this problem? From here, we'll go through pre-training and fine tuning starting at the highest level and then delving further and further into details after every pass. So let's go deeper into each phase. So pre-training, the goal of pre-training is to make BERT learn what is language and what is context. BERT learns language by training on two unsupervised tasks simultaneously. They are mass language modeling and next sentence prediction. For mass language modeling, BERT takes in a sentence with random words filled with masks. The goal is to output these mask tokens. And this is kind of like film the blanks. It helps BERT understand a bidirectional context within a sentence. In the case of next sentence prediction, BERT takes in two sentences and it determines if the second sentence actually follows the first in kind of what is like a binary classification problem. This helps BERT understand context across different sentences themselves. And using both of these together, BERT gets a good understanding of language. Great, so that's pre-training. Now, the fine tuning phase. So we can now further train BERT on very specific NLP tasks. For example, let's take question answering. All we need to do is replace the fully connected output layers of the network with a fresh set of output layers that can basically output the answer to the question we want. Then we can perform supervised training using a question answering data set. It won't take long since it's only the output parameters that are learned from scratch. The rest of the model parameters are just slightly fine tuned. And as a result, training time is fast. And we can do this for any NLP problem that is replace the output layers and then train with a specific data set. Okay, so that's pass one of the explanation on pre-training and fine tuning. Let's go on to pass two with some more details. During BERT pre-training, we train on mass language modeling and next sentence prediction. In practice, both of these problems are trained simultaneously. The input is a set of two sentences with some of the words being masked. Each token is a word and we convert each of these words into embeddings using pre-trained embeddings. This provides a good starting point for BERT to work with. Now on the output side, C is the binary output for the next sentence prediction. So it would output one if sentence B follow sentence A in context and zero if sentence B doesn't follow sentence A. Each of the T's here are word vectors that correspond to the outputs for the mass language model problem. So the number of word vectors that we input is the same as the number of word vectors that we output. Now on the fine tuning phase though, if we wanted to perform question answering, we would train the model by modifying the inputs and the output layer. We pass in the question followed by a passage containing the answer as inputs and in the output layer, we would output the start and the end words that encapsulate the answer, assuming that the answer is within the same span of text. Now that's pass two of the explanation. Now for pass three, where we dive further into details. This is going to be fun. On the input side, how are we going to generate these embeddings from the word token inputs? Well, the initial embedding is constructed from three vectors. The token embeddings are the pre-trained embeddings. The main paper uses word piece embeddings that have a vocabulary of 30,000 tokens. The segment embeddings is basically the sentence number that is encoded into a vector. And the position embeddings is the position of a word within that sentence that is encoded into a vector. Adding these three vectors together, we get an embedding vector that we use as input to BERT. The segment and position embeddings are required for temporal ordering, since all these vectors are fed in simultaneously into BERT and language models need this ordering preserved. Cool, the input is starting to piece together pretty well. Let's go to the output side now. The output is a binary value C and a bunch of word vectors. But with training, we need to minimize a loss. So two key details to note here. All of these word vectors have the same size and all of these word vectors are generated simultaneously. We need to take each word vector, pass it into a fully connected layered output with the same number of neurons equal to the number of tokens in the vocabulary. So that would be an output layer corresponding to 30,000 neurons in this case. And we would apply a softmax activation. This way we would convert a word vector to a distribution. And the actual label of this distribution would be a one-hot encoded vector for the actual word. And so we compare these two distributions and then train the network using the cross entropy loss. But note that the output has all the words. Even though those inputs weren't masked at all. The loss though only considers the prediction of the masked words and it ignores all the other words that are output by the network. This is done to ensure that more focus is given to predicting these mass values so that it gets them correct and it increases context awareness. So that was a three passes of explaining the pre-training and fine-tuning of bird. So let's put this all together. We pre-trained Burt with mass language modeling and next sentence prediction. For every word, we get the token embedding from the pre-trained word piece embeddings, add the position and segment embeddings to account for the ordering of the inputs. These are then passed into Burt, which under the hood is a stack of transformer encoders and it outputs a bunch of word vectors for mass language modeling and a binary value for an extended prediction. The word vectors are then converted into a distribution to train using cross entropy loss. Once training is complete, Burt has some notion of language. It's a language model. The next step is the fine-tuning phase where we perform a supervised training depending on the task we want to solve. And this should happen fast. In fact, the Burt squad, that is the Stanford question and answer model, only takes about 30 minutes to fine-tune from a language model for a 91% performance. Of course, performance depends on how big we want Burt to be. Now the Burt large model, which has 340 million parameters, can achieve way higher accuracies than the Burt base model, which only has 110 parameters. There's so much more to address about the internals of Burt that I could go on forever. But for now, I hope this explanation was good to get you an idea of what Burt really does under the hood. For more details on the transformer neural network architecture, which is the foundations of Burt itself, click on this video. Subscribe and stay safe, a lot more content coming your way soon, and I'll see you soon. Bye-bye!