 Hey, Billy. Hi, Bobby. How are you today? Pretty good. What about yourself? Pretty good. So how's the weather today? Yeah, it's somewhere in the mid 70s 70s the metric system. What's that dude? No No, really, what's that dude? No We want to build a chatbot like this We would typically model these problems using recurrent neural nets and put a sentence output another sentence However, training this chatbot is tough. They need to go from knowing nothing to knowing how to converse Word to vex save some time here It maps words to vectors that encode their meaning with words as vectors We can find distances between them distances that represent how different two words are in their meaning This provides a good starting point to train NLP models faster But these vectors aren't extremely accurate. If you look at the word to vex architecture It's just a neural network with a single hidden layer and because of this we can only capture very surface level representations Language is complex to create a good embedding the vector needs to understand pronouns negations long-term dependencies and so many other things All of these interactions can't be captured by a mere hidden layer of a neural network Instead of a layer to generate embedding What if we pre-trained a model to learn the basics of language and then fine-tune this model for different NLP tasks? Knowing the fundamentals of language it becomes easier to pick up different tasks This is the goal that deep learning in NLP has been striving to address But what NLP problem captures the fundamentals of language? Is it machine translation or question answering or chatbots? Well, none of those really because models on these problems tend to learn patterns in data rather than the languages themselves The problem that captures language the best is language modeling given a part of a sentence predict the next word It captures contextual cues Once we pre-train a language model We can then use transfer learning to fine-tune this model on a host of other NLP tasks So let's talk about this language model a little bit more The first major breakthrough architecture that came out was Elmo Elmo created embeddings that preserved context the same word can have different vectors depending on the context At its core it contained two bi-directional RNNs trained for language modeling We have a naive input vector for every word and it's passed through the first layer to capture the left to right context And it's passed through the second layer to capture the right to left context These two outputs are concatenated to get a vector that has context before and after the current word Elmo has two such bi RNNs So we have another similar type of context vector for every word And then we take the weighted sum of the two context vectors and the naive input to get the final word embedding So we got context with Elmo nice But there are some problems with LSTM networks First of all, they are super slow to train Input data needs to be passed sequentially one after the other This kind of sequential flow doesn't make use of today's GPUs pretty well, which is designed for parallel computation So how do we parallelize this process? Well in 2017 transformer neural nets were introduced. I have an entire video on this, but I'll give you the gist It has an encoder decoder architecture that was made to solve the problem of language translation So we want to translate English to French the encoder takes English words Simultaneously and spits out the word vectors simultaneously If we're determining the ith word of a French translation The decoder takes in the vectors from the encoder and the first I minus one French words It then spits out the next possible word in the translated French sentence And we generate the translated words one at a time until we hit the end of sentence token This transformer architecture can achieve better results than LSTMs for the problem of language translation But remember our goal here is to use this transformer as a pre-trained model kind of like we did for Elmo We need a way to train a language model with this a model that predicts the next word given its previous words in a sentence To this open AI just said just get rid of the encoders and stack the decoders done Doing this we get a deep network where we can pass in our input words in parallel and have the decoder spit out the next word The core of each decoder is a masked attention unit So what is that? Attention this allows the network to determine how much attention to pay to every word of the input sentence While generating the next word It's called masked attention because in language translation We aren't supposed to look at the entire sentence during training for the translation and hence We mask them paying it no attention to them while generating our next word. That would be cheating otherwise And so we have a language model with the open AI transformer But we have another problem because of the mass attention word embeddings are constructed using context before it however in reality the context of a word comes from the words before it and the words that follow and Also even in Elmo though It was a bidirectional recurrent neural network the word embeddings using context before and after were determined separately and then concatenated We would like to have an architecture that incorporates the speed of the open AI transformer but can also absorb the context from both directions like Elmo and Additionally absorb context from both directions Simultaneously unlike those other two This is where our friend Bert comes in Bert Bert creates embeddings by pre-training on two unsupervised tasks. The first one is mass language modeling Given some sentences randomly mask some words Bert should learn what these masked words are and this allows Bert to learn bidirectional context and The other problem that Bert needs to solve is next sentence prediction Given two sentences a and b it needs to determine if sentence b actually comes after sentence a or not This allows Bert to understand the relationship between two sentences and this is not even captured by language modeling So this is fantastic and Bert was the state of the art for about three years until now There are some problems with Bert too We determined the vector of masks considering words that come before it and after it But there are cases where the masks can be highly dependent on the other masks too Especially if they are the main focus of the sentence We may end up with syntactically correct English, but it may not make much sense and this can lead to low quality embeddings Excel net tries to address this To better understand what's going on. I'm gonna pull up some graphics When we builds a language model with a transformer neural net, we're effectively predicting one word given the previous words There is no backward context Note that the light purple shade here means that we're not using this input because it's masked Bert's modification with masks allows us to look backwards and forwards and we determine the mask simultaneously But this leads to the problem of making predictions on top of predictions and hence we get low quality vector embeddings Instead let's use a permutation language model It generates words one at a time like the traditional language model But it can also look forwards and backwards like Bert while generating every word Excel net uses this permutation type language model and I hope you can see how this actually leads to much higher quality embeddings I'll link you to a blog post that I got these graphics from that dissects Excel net with an easy to understand explanation So take a look at that. It's a pretty good read Now that you have some basic knowledge you can try coding this out This is hugging face They have risen in popularity for their open source library on Transformers it is very well documented and they have guides for different transformer neural net architectures Also, check out another open source library called spacey this is used for coding general NLP tasks and they have an API that connects to hugging face transformers and So you can use them out of the box for any other task This too is pretty well documented. They support TensorFlow 2 and PyTorch as of making this video So it's worth checking out Hope this video helped you get a better insight on the current state of NLP There are a bunch of transformer architectures out there to learn embeddings Each one of them addresses a flaw in their predecessors performance leading to the evolution in the field This is just a tip of the iceberg and i'll make more videos on this soon For now be satisfied with the resources in the description and i'll see y'all later Bye. Bye