 I'm going to talk about sequence to sequence learning and how it has advanced. So this talk is going to start from the start of sequence to sequence learning and how newer models have come and they have changed the way the sequence to sequence learning started. So before I begin a brief introduction about myself, I have a background in finance and banking. I started as a developer, worked as a data scientist for some time, then I started leading teams for analytics. Recently I'm working in Accenture right now, I'm heading a center of excellence for AI and advanced analytics focusing on finance and risk domain. So I'm also advisor for some startups on their AI strategy. So on my work side, I work basically on credit risk and financial crime and how to use AI on that and our nightlife. So I read about deep learning and I'm learning blockchain, AGI. I compete on Kaggle occasionally. So let's go to the talk now. So this is a brief overview of a sequence to sequence model. So the model has got basically two parts. There is this encoder part and the decoder part. So you put the input sequence in the encoder, it takes it in a time step fashion and then it creates a representation which is spread into the decoder part with the start token. In this case it's go and then the decoder is able to come out with the output. So you train with a lot of input and output tokens and the model learns this context and representations. So this is the starting era of sequence to sequence models. So what comes on the left side and right side is essentially RNN, recurrent neural network. So again, it's the same thing, it's on a simpler representation, it's the input going and giving output but there is a loop back and if you try to unroll it on a time scale so you can see that apart from giving output in the current stage, it can also push back some set of context into the next stage and that is used to learn at the next stage. So this RNN can be used on the encoder and the decoder side at both ends of a sequence to sequence model. It can carry over short term context. So you have a sentence, the cloud are in the sky, so it can essentially hold the knowledge of something which has come before in recent past but where it fails is if you have a longer context where you had a bigger sentence so it can't remember a lot of history in the past. So if you want to have some more understanding about this, so this link in the bottom is where I have taken this diagram from so there's a very good blog where you can learn about sequence to sequence and RNN. So what's the problem with the sequence to sequence, why is it not able to retain such a long term memory is because when you try to do the back propagation through which the gradients are passed, the gradients start to diminish as the values are passed from say S3 to S2 and if there are smaller values coming in between then the multiplication causes the values to diminish and become zero. So you have this problem of vanishing gradients which essentially means you are not able to go back in time a lot and refer to that value or pick the context from there. So with this problem, what do you do now? So before that, okay, so another way to represent the RNN simply is I have just taken this slide to draw the comparison with the next slide. So essentially you have an input and the last state which is coming from the previous state of the model and then you have a simple function in this case the tanh which is allowing you to combine both the input and output and generate the output of this state. So the solution to this was LSTM which was supposed to be a special case of gated recurrent networks called GRU. So essentially in LSTM you have this top line which is the context line. It's like the ResNet architecture where you are trying to bypass the or you are trying to pass this context throughout the state so it can go a bit longer in terms of memory and then you have these various gates. So you have this forget gate which at each states allows you to tell like what is the thing you should not remember right now. Then there is an input gate which tells what are the things you should start remembering in this state and then you compute the output and all this way the context vector is passing straight through. So what it allows you is that it allows you the opportunity to go long term. So it can remember sequences or it can take input from sequences which are like in hundreds of time steps but it can't go beyond that. So there are inherent limitations here as well. So again you have this link below if you want to read or understand more about them. So they became very popular the LSTM models and they started getting used in various applications. So your speech to text, all your conversational agents and as we just saw that in the generative part we are using such RNN or LSTM based sequence to sequence model, neural machine translation, image to text, video captioning. So these are some of the applications. So what are the challenges? So challenges are that RNNs they operate from either left to right or right to left which essentially means they're not very good candidates to be put on GPU for parallelization. So you have to go through each time step. So they don't parallelize very well. So what can we do then? CNN comes to the rescue. So the next evolution was to try and use CNNs which are much more computationally parallelizable on GPUs. So before I jump into the details of models which are using CNNs let's talk about another concept called attention. So in the sequence to sequence model as you have the left part is the encoder and the right part is the decoder. But the way humans look at a sentence when they're answering is they take a glance at the thing rather than going step by step and reading the thing again. So that's the conceptual level understanding of attention. So you create something called as attention weights from each state and attention vector, a context in a vector which can be then referenced in each stage of the decoder to look back at the full input sentence and then pick up the things which are most interesting out of that. So now we move back to the convolutional part of it, the sequence to sequence model. So this is a newer set of models. So you have the encoder on the top of the model. You have the decoder on the bottom left. And if you see the encoder and decoder are actually giving their inputs to the attention module which is nothing but a dot product. And then finally, as in a normal sequence to sequence model, from the top to the bottom you can see from that side that the encoder value is coming, it's getting modified by the attention parameters. And the bottom part of the decoder is able to glance through those values and pick up what is the important concept and then based on that the output comes. Now here, again the triangles that you are seeing are convolutional layers. So they are computationally more efficient. So rather than having simple LSTMs, we are using convolutional layers here. So they can parallelize even more better. So this is a paper from which I have taken this diagram. So there are a lot more details here if you want to glance at it. So this is the application of the convolution-based models for encoder-decorator in neural machine translation which has been done at Facebook. So this is kind of showing you how these models are. While decoding the output, they have a glance at the entire input and they are able to use a tension mechanism to pick up. But then they are basically convolutional-based models which are more efficient to compute as well. OK, so then there is something called as temporal convolution network which is another set of models, another set of advancements. And here you see this block of residual part. So this is the kind of function which has all the convolutional layers and the nonlinear layers which create a unit of work and it has also got this optional bypass layer. And then you have this, you can create these in various sequences. So you have a value of k which essentially means how many layers deep you are going to go. And then at each factor, you have a dilation factor means how much are you going to skip in each. So basically you are stacking these convolutional layers on top of each other and using them to get more meaning out of the data. So essentially what's the advantage here is that, number one, they are convolutional-based. So they are more computationally efficient. But then the second time, they are able to look back in memory substantially more like a wider span on the memory, even more than LSTMs. So they can really look back in memory. So this is the example of WaveNet 2. Although WaveNet 2 is not encoded, it's a generative model, but it essentially used the dilated convolution neural network. So this is kind of a representation of how it used the same similar kind of architecture to train the WaveNet 2 models. Then while all this was happening, Google was writing this paper called, a very interesting paper called Attention Is All You Need. So this is, again, a next step in the encoder-decoder paradigm. So essentially now there are a couple of parts to this. So first let's look at the left side and the scalar dot product. So essentially you have a query vector, a value vector, and a key vector. So key and value is what you have encoded, and then you query those values. So there are these mathematical functions and scaling factors, masking factors. And this scalar dot product is represented as a multi-head attention, wherein you have multiple such scalar dot product or attention modules which are going on. And if you see on the left side, left part of this architecture is the encoder part. The right side is the decoder part. And there are nx means that there are many layers of it. So there can be like six, eight layers of this thing stacked on top of each other. So now if you try to compare it with the previous LSTM model, which was like a sequential set of inputs going through, this is a different way of looking at the same thing. So I'll go into the parts of this now. So on the left side, on the encoder part, you have this something called as self-attention, which is a value of key value pairs, which basically, so from each step of the model, because there are number of layers here, so each layer is able to get the key value and the query part. So all the three values are seen or can be attended from the last layer of the model. So it's able to look at all the representations of the previous layers and then it's able to create a representation. Then that would be fed forward to the feed forward network and then we'll go to the decoder part of the network. So on the right side is the similar thing as a self-attention, again on the decoder side. So you have again the three inputs coming, the query, the value and key inside the decoder part, and here the special thing is this masking, because you need to prevent the left-ward flow of information essentially to preserve the autoregressive property. So they mask anything which is not required with the minus infinity to make sure that you don't get these values through. Then the next part of the model is on the decoder part, the above. So here the key thing is that the key and value part is coming from the encoder. So this is what you want actually. You want the encoder parts key and values to come and the decoder is able to query on them. So the decoder is now able to see over the whatever representation the input part has created. Then since we are putting all the values in one go, so you need some way to encode positions as well, because you need positional encoding in these models. So this positional encoding is coming in both the encoder and decoder layer. They could be a learned function or a fixed function like a sign function in this case. So this model of attention is called a transformer. This whole thing is called a transformer. After this, there were a lot of applications of transformers which led to various other advancements and papers. So I think Madin is going to cover this probably in a little more detail. But this is a recent paper from OpenAI on improving the language understanding. So essentially what they have done is they have used the decoder part of the model, and they have 12 layers of it, and they have used it in an unsupervised manner to put a lot of text through it and let it learn a representation of that text. So I think of it like a word-to-word thing where you are doing a lot of unsupervised stuff and letting it learn some kind of an internal representation. Then the second part of it is where it becomes supervisors where you try to put now a linear function on top of it and you give it some labeled data to allow it to learn specific features. So you allow it to either do a comparison or a text classification or a similarity matrix. So again, the code and the details are there. So if you guys want to refer it, then we go on to the next set of things, which is again if you can also have a hierarchical attention. So attention on attention, which essentially means it allows you to go even broader on your encoder-decoder part. So again, one of the examples for this kind of a model is where you can think of words forming sentences and sentences forming paragraphs and then you can use that to do some classification. So you have these words going into encoder layer of word encoder. Now this encoder could be internally grooves as well and then you have this attention layer in the green color, which is allowing the sentences, each sentence to pay attention on what words are coming, what are the special things it needs to capture on those words. Then you have this sentence encoder, and there is a second layer of attention sitting on top of that, which is able to find out important sentences within a paragraph and it's able to combine all that into a document vector, which is then can be used for classification or something else. So I think I've mainly covered whatever the advancements I had to, so any questions or anything you want to. All right. My thing is to make a wall set up, so I have some questions. Basically, the idea I was trying to show about the journey of evolution, but it doesn't scale well with GPU. So today GPUs and GPUs, what do you have? So you can't use a lot of such accelerated hardware to train these networks faster, so it takes time. CNNs are much more parallelizable, so that means you can use or take advantage of these hardware to train such networks for neural machine translation, where you have billions of words, where you want to do this kind of translation in a much, much faster and much more computationally efficient way. So I think we'll take a day in the end, they have to perform a speaker and a recorder. Yes. Yes. And they're power or better.