 Recurrent neural nets. They are feed-forward neural networks rolled out over time. As such, they deal with sequence data, where the input has some defined ordering. This gives rise to several types of architectures. The first is vector to sequence models. These neural nets take in a fixed-size vector as input, and it outputs a sequence of any length. In image captioning, for example, the input can be a vector representation of an image, and the output sequence is a sentence that describes the image. The second type is a sequence-to-vector model. These neural networks take in a sequence as input and spits out a fixed-length vector. In sentiment analysis, the movie review is an input, and a fixed-size vector is the output indicating how good or bad this person thought the movie was. Sequence-to-sequence models is the more popular variant, and these neural networks take in a sequence as input and outputs another sequence. So, for example, language translation, the input could be a sentence in Spanish, and the output is the translation in English. Do you have some time-series data to model? Well, RNNs would be the go-to. However, RNNs have some problems. RNNs are slow. So slow that we use a truncated version of backpropagation to train it, and even that's too hardware-intense. And also, they can't deal with long sequences very well. We get gradients that vanish and explode if the network is too long. In comes LSTM networks in 1991 that introduced a long short-term memory cell in place of dumb neurons. This cell has a branch that allows past information to skip a lot of the processing of the current cell and move on to the next. This allows the memory to be retained for longer sequences. Now, to that second point, we seem to be able to deal with longer sequences well. Or are we? Well, kind of. Probably if the order of hundreds of words instead of a thousand words. However, to the first point, normal RNNs are slow, but LSTMs are even slower. They're more complex. For these RNN and LSTM networks, input data needs to be passed sequentially, or serially, one after the other. We need inputs of the previous state to make any operations on the current state. Such sequential flow does not make use of today's GPUs very well, which are designed for parallel computation. So question, how can we use parallelization for sequential data? In 2017, the transformer neural network architecture was introduced. The network employs an encoder-decoder architecture much like recurrent neural nets. The difference is that the input sequence can be passed in parallel. Consider translating a sentence from English to French. I'll use this as a running example throughout the video. With an RNN encoder, we pass an input English sentence one word after the other. The current words hidden state has dependencies in the previous words hidden state. The word embeddings are generated one time step at a time. With a transformer encoder on the other hand, there is no concept of time step for the input. We pass in all the words of the sentence simultaneously and determine the word embeddings simultaneously. So how is it doing this? Let's pick apart the transformer architecture. I'll make multiple passes on the explanation. The first pass will be like a high overview and the next rounds will get into more details. Let's start off with input embeddings. Computers don't understand words. They get numbers, they get vectors, and matrices. The idea is to map every word to a point in space where similar words and meaning are physically closer to each other. The space in which they are present is called an embedding space. We could pre-train this embedding space to save time, or even just use an already pre-trained embedding space. This embedding space maps a word to a vector, but the same word in different sentences may have different meanings. This is where positional encoders come in. It's a vector that has information on distances between words and the sentence. The original paper uses a sign and cosine function to generate this vector, but it could be any reasonable function. After passing the English sentence through the input embedding and applying the positional encoding, we get word vectors that have positional information, that is context. Nice. We pass this into the encoder block where it goes through a multi-headed attention layer and a feed-forward layer. Okay, one at a time. Attention. It involves answering what part of the input should I focus on. If we are translating from English to French, and we are doing self-attention, that is attention with respect to oneself, the question we want to answer is how relevant is the ith word in the English sentence relevant to other words in the same English sentence. This is represented in the ith attention vector, and it is computed in the attention block. For every word, we can have an attention vector generated which captures contextual relationships between words in the sentence. So that's great. The other important unit is a feed-forward net. This is just a simple feed-forward neural network that is applied to every one of the attention vectors. These feed-forward nets are used in practice to transform the attention vectors into a form that is digestible by the next encoder block or decoder block. Now, that's the high-level overview of the encoder components. Let's talk about the decoder now. During the training phase for English to French, we feed the output French sentence to the decoder. But remember, computers don't get language. They get numbers, vectors, and matrices. So we process it using the input embedding to get the vector form of the word. And then we add a positional vector to get the notion of context of the word in a sentence. We pass this vector, finally, into a decoder block that has three main components, two of which are similar to the encoder block. The self-attention block generates attention vectors for every word in the French sentence to represent how much each word is related to every word in the same sentence. These attention vectors and vectors from the encoder are passed into another attention block. Let's call this the encoder-decoder attention block since we have one vector from every word in the English and French sentences. This attention block will determine how related each word vector is with respect to each other. And this is where the main English to French word mapping happens. The output of this block is attention vectors for every word in English and the French sentence. Each vector representing the relationships with other words in both the languages. Next, we pass each attention vector to a feedforward unit. This makes the output vector more digestible by either the next decoder block or a linear layer. Now, the linear layer is, surprise, surprise, another feedforward connected layer. It's used to expand the dimensions into the number of words in the French language. The softmax layer transforms it into a probability distribution, which is now human-interpretable. And the final word is the word corresponding to the highest probability. Overall, this decoder predicts the next word, and we execute this over multiple time steps until the end of sentence token is generated. That's our first pass over the explanation of the entire network architecture for transformers. But let's go over it again, but this time introduce even more details, going deeper. An input English sentence is converted into an embedding to represent meaning. We add a positional vector to get the context of the word in the sentence. Our attention block computes the attention vectors for each word. Only problem here is that the attention vector may not be too strong. For every word, the attention vector may weight its relation with itself much higher. It's true, but it's useless. We are more interested in interactions with different words, and so we determine like eight such attention vectors per word and take a weighted average to compute the final attention vector for every word. Since we use multiple attention vectors, we call it the multi-head attention block. The attention vectors are passed in through a feedforward net one vector at a time. The cool thing is that each of the attention nets are independent of each other, so we can use some beautiful parallelization here. Because of this, we can pass all our words at the same time into the encoder block, and the output will be a set of encoded vectors for every word. Now the decoder. We first obtain the embedding of French words to encode meaning. Then add the positional value to retain context. They are then passed to the first attention block. The paper calls this the masked attention block. Why is this the case though? It's because while generating the next French word, we can use all the words from the English sentence, but only the previous words of the French sentence. If we are going to use all the words in the French sentence, then there would be no learning. It would just spit out the next word. So while performing parallelization with matrix operations, we make sure that the matrix will mask the words appearing later by transforming it into zeros, so the attention network can't use them. The next attention block, which is the encoder-decoder attention block, generates similar attention vectors for every English and French word. These are passed into the feedforward layer, linear layer, and the softmax layer to predict the next word. That's the pass too over the architecture explain. I hope you're understanding more and more details here. Now for the next pass where we go even deeper. How exactly do these multi-head attention networks look? Now the single-headed attention looks like this. Q, K, and V are abstract vectors that extract different components of an input word. We have Q, K, and V vectors for every single word. We use these to compute the attention vectors for every word using this kind of formula. For multi-headed attention, we have multiple weight matrices, W, Q, W, K, and W, V. So we will have multiple attention vectors, Z, for every word. However, our neural net is only expecting one attention vector per word. So we use another weighted matrix, W, Z, to make sure that the output is still an attention vector per word. Additionally, after every layer, we apply some form of normalization. Typically, we would apply a batch normalization. This smoothens out the lost surface, making it easier to optimize while using larger learning rates. This is the TLDR, but that's what it does. But we can actually use something called layer normalization, making the normalization across each feature instead of each sample. It's better for stabilization. If you are interested in dabbling in transformer code, TensorFlow has a step-by-step tutorial that can get you up to speed. Transformer neural nets have largely replaced LSTM nets for sequence-to-vector, sequence-to-sequence, and vector-to-sequence problems. Google, for example, created BERT, which uses transformers to pre-train models for common NLP tasks. Read that blog. It's good. However, there was another paper called pervasive attention that could be even better than transformers for sequence-to-sequence models. Although transformers can be better suited for a wider variety of problems, it's still a very interesting read. I'll link it in the description below with other resources. So check that out. Hope this helped you get you up to speed with transformer neural nets. If you liked the video, hit that like button. Subscribe to stay up to date with some deep learning and machine learning knowledge, and I will see you guys in the next one. Bye-bye.